CN102214356A

CN102214356A - NVIDIA graphics processing unit (GPU) platform-based best neighborhood matching (BNM) parallel image recovering method

Info

Publication number: CN102214356A
Application number: CN2011101503911A
Authority: CN
Inventors: 何立强; 张广勇; 张艳燕
Original assignee: Inner Mongolia University
Current assignee: Inner Mongolia University
Priority date: 2011-06-07
Filing date: 2011-06-07
Publication date: 2011-10-12

Abstract

The invention discloses a parallel image restoration method based on NVIDIA GPU platform, comprising the following steps: step 100, scanning the image with bad blocks, finding the bad blocks and marking them; step 110 , according to the CUDA programming technology, design the BNM parallel algorithm based on the GPU platform; Step 120, according to the optimization technology of CUDA, combine the parameters of the GPU model adopted and the height of the image size and the bad block rate, parallelize the BNM designed in the step 110 The algorithm is improved; step 130, according to the optimization result of step 120, find the algorithm with the best performance to restore the image. The invention is based on the NVIDIA GPU platform and adopts CUDA parallel programming technology to realize the parallel image restoration algorithm of BNM.

Description

Optimal Neighborhood Matching Parallel Image Restoration Method Based on NVIDIA GPU Platform

技术领域technical field

本发明涉及一种基于NVIDIA GPU平台的最佳邻域匹配并行图像恢复方法，涉及图形图像领域，尤其涉及图像的恢复。The invention relates to an optimal neighborhood matching parallel image restoration method based on an NVIDIA GPU platform, relates to the field of graphics and images, in particular to image restoration.

背景技术Background technique

近几十年中，数字图像处理和网络媒体的研究和应用得到了飞速发展，人们对数字图像和视频的质量要求越来越高。图像在传输过程中不可避免地会产生传输错误，为此，人们提出了许多方法来提高信道的可靠性，期望获得质量有保证的图像和视频。解决这一问题的另一方法是，从图像自身角度考虑，利用已接收到的图像，根据图像自身的相关性，用正常接收到的图像像素对受损像素进行恢复。这一方法正是差错掩盖技术的主要思想。最佳邻域匹配算法受到分形理论的启发，利用了图像块与块之间的相似性对图像进行恢复和重建。最佳邻域匹配算法对在传输过程中受损的图像进行恢复，图像的恢复效果好，能得到质量较高的恢复图像，但是算法本身计算量很大，执行效率不高。In recent decades, the research and application of digital image processing and network media have developed rapidly, and people have higher and higher requirements for the quality of digital images and videos. Transmission errors will inevitably occur during the transmission of images. For this reason, many methods have been proposed to improve the reliability of the channel and expect to obtain images and videos with guaranteed quality. Another way to solve this problem is to use the received image from the perspective of the image itself, and restore the damaged pixel with the normally received image pixel according to the correlation of the image itself. This method is the main idea of error concealment technology. Inspired by fractal theory, the optimal neighborhood matching algorithm utilizes the similarity between image blocks to restore and reconstruct images. The best neighborhood matching algorithm restores the image damaged during the transmission process, the image restoration effect is good, and a high-quality restored image can be obtained, but the algorithm itself has a large amount of calculation and the execution efficiency is not high.

在图形图像领域中，可以使用差错补偿技术(Error Concealment)对坏块进行图像恢复。在众多的图像恢复方法中，最佳邻域匹配(BNM，Best Neighborhood Matching)算法效果较好。该算法不仅利用了相邻块的像素信息，还利用了远程块的像素信息，因此恢复效果较好。In the field of graphics and images, error compensation technology (Error Concealment) can be used to restore images of bad blocks. Among many image restoration methods, the Best Neighborhood Matching (BNM, Best Neighborhood Matching) algorithm works better. This algorithm not only utilizes the pixel information of adjacent blocks, but also utilizes the pixel information of remote blocks, so the restoration effect is better.

图1说明了BNM算法的操作过程，黑色块表示一个坏块(大小为x1*x2像素)，围绕在坏块周围的像素点((x1+2)*(x2+2)像素)称为界限窗口。以界限窗口为中心，形成一个大小为r1*r2的搜索区域(称为搜索窗口)，在搜索区域中以每次1个像素的步距不断形成远端窗口，远端窗口与局域窗口具有相同的形状和大小。在坏块的搜索窗口内找到一个和坏块大小相等并且最匹配的远端窗口来替换坏块。从左上角逐步搜索，每次移动一步，搜索完毕后找到最佳匹配的远端窗口块。在这种方式下，一幅图像中的所有坏块可逐个得到恢复。Figure 1 illustrates the operation process of the BNM algorithm. The black block represents a bad block (size x1*x2 pixels), and the pixels around the bad block ((x1+2)*(x2+2) pixels) are called boundaries window. With the boundary window as the center, a search area (called the search window) with a size of r1*r2 is formed. In the search area, the far-end window is continuously formed at a step of 1 pixel each time. The far-end window and the local window have the same Same shape and size. Find a remote window with the same size as the bad block and the best match within the bad block search window to replace the bad block. Search step by step from the upper left corner, move one step at a time, and find the best matching remote window block after the search is completed. In this way, all bad blocks in an image can be recovered one by one.

自从2006年NVIDIA公司推出图形处理器G80(包含了128个流处理器SP，最新的G200包含了240个流处理器SP)以来，GPU在某些大规模并行计算的应用上，相对于CPU来说性能提高可达100倍以上。尤其从2008年5月，NVIDIA推出用于GPU的开发平台CUDA SDK 1.1以来，基于GPU平台的并行计算得到了大规模的推广。CUDA为GPU计算提供了统一计算设备架构，使用户很容易地将GPU编程融于传统的编程工具(如：Visual Studio，gcc等)和语言(如：C，C++，FORTRAN等)中。在短短的一年以来，CUDA被应用于加速大规模并行计算领域的许多问题，如在图像处理，物理模型模拟(如计算流体力学)，工程和金融模拟与分析，生物医药工程，数据库、数据挖掘，搜索，排序等方面都有很好的应用，在很多应用中取得了1-2个数量级的加速。Since NVIDIA launched the graphics processor G80 (including 128 stream processor SPs, and the latest G200 includes 240 stream processor SPs) in 2006, GPUs have been more powerful than CPUs in certain large-scale parallel computing applications. It is said that the performance improvement can reach more than 100 times. Especially since NVIDIA launched CUDA SDK 1.1, a development platform for GPUs in May 2008, parallel computing based on GPU platforms has been promoted on a large scale. CUDA provides a unified computing device architecture for GPU computing, enabling users to easily integrate GPU programming into traditional programming tools (such as: Visual Studio, gcc, etc.) and languages (such as: C, C++, FORTRAN, etc.). In just one year, CUDA has been applied to accelerate many problems in the field of large-scale parallel computing, such as in image processing, physical model simulation (such as computational fluid dynamics), engineering and financial simulation and analysis, biomedical engineering, database, It has good applications in data mining, searching, sorting, etc., and has achieved 1-2 orders of magnitude acceleration in many applications.

GPU拥有更多的晶体管用于数据处理而不是去处理数据cache和指令控制，这意味着GPU具有巨大的并行计算能力。在GPU中，单一的数据处理单元为流处理器(SP)，8个流处理器组成一个流处理器组(SM)，一个GPU有多个流处理器组，每个流处理器组除了有8个流处理器，还有一些caches(纹理存储器，常量存储器，共享存储器)和两个特殊功能单元(SFU)。片外全局存储器用来存储数据并实现CPU和GPU之间的数据传递。GPU has more transistors for data processing instead of processing data cache and instruction control, which means that GPU has huge parallel computing capabilities. In a GPU, a single data processing unit is a stream processor (SP), 8 stream processors form a stream processor group (SM), a GPU has multiple stream processor groups, and each stream processor group has 8 stream processors, some caches (texture memory, constant memory, shared memory) and two special function units (SFU). The off-chip global memory is used to store data and realize data transfer between CPU and GPU.

CUDA(Compute Unified Device Architecture，计算统一设备架构)作为GPU的并行编程语言。CUDA编程把CPU称之为主机，GPU作为一个协处理器被称为设备。CUDA编程中，多个线程同时执行在GPU上，多个线程组成一个线程块，多个块又组织成网格，另外，每32个线程组成一个warp。CUDA编程中常用到的优化技术有合理的网格配置，每个SM上足够多的warp隐藏访问延迟，全局存储器的合并访问，共享存储器的使用，纹理存储器和常量存储器的使用，寄存器的合理使用等等。CUDA (Compute Unified Device Architecture, Compute Unified Device Architecture) is used as a parallel programming language for GPU. CUDA programming calls the CPU a host, and the GPU as a coprocessor is called a device. In CUDA programming, multiple threads are executed on the GPU at the same time. Multiple threads form a thread block, and multiple blocks are organized into a grid. In addition, every 32 threads form a warp. The optimization techniques commonly used in CUDA programming include reasonable grid configuration, enough warps on each SM to hide access delays, combined access to global memory, use of shared memory, use of texture memory and constant memory, and reasonable use of registers etc.

最佳邻域匹配(BNM)算法在所有图像恢复算法中恢复效果较好，但计算量大，需要提高运算效率。目前，国内外对BNM的研究也有很多，主要都是针对BNM的全搜索法进行了改进，提出了不同的算法，如：两步最优邻域匹配(Two-Step Best Neighborhood Matching，TSBNM)算法、基于进化策略的BNM算法(ES_BNM)、跳跃环顾最优邻域匹配(Jump and Look Best Neighborhood Matching，JLBNM)算法、改进的跳跃环顾BNM算法(Improved JLBNM，IJLBNM)等。这些改进都只是基于CPU上算法的改进，性能虽有数倍的提高，但进一步提高性能变得越来越困难。The Best Neighborhood Matching (BNM) algorithm has a better recovery effect among all image restoration algorithms, but it has a large amount of calculation and needs to improve the operation efficiency. At present, there are many researches on BNM at home and abroad, mainly for the improvement of the full search method of BNM, and different algorithms are proposed, such as: Two-Step Best Neighborhood Matching (Two-Step Best Neighborhood Matching, TSBNM) algorithm , BNM algorithm based on evolution strategy (ES_BNM), jump and look best neighborhood matching (Jump and Look Best Neighborhood Matching, JLBNM) algorithm, improved jump and look BNM algorithm (Improved JLBNM, IJLBNM), etc. These improvements are only based on the improvement of the algorithm on the CPU. Although the performance has been improved several times, it is becoming more and more difficult to further improve the performance.

发明内容Contents of the invention

本发明针对现有技术的不足，提供了一种基于GPU平台的BNM并行图像恢复方法。对有坏块的图像采用基于GPU平台的BNM并行差错补偿算法进行图像恢复。Aiming at the deficiencies of the prior art, the present invention provides a BNM parallel image restoration method based on a GPU platform. For images with bad blocks, the BNM parallel error compensation algorithm based on the GPU platform is used for image restoration.

一种基于NVIDIA GPU平台的最佳邻域匹配并行图像恢复方法，包括以下步骤：A kind of optimal neighborhood matching parallel image restoration method based on NVIDIA GPU platform, comprising the following steps:

步骤100，对有坏块的图像进行扫描，找到坏块的地方并对其进行标记；Step 100, scanning the image with bad blocks, finding the bad blocks and marking them;

步骤110，根据CUDA编程技术，设计基于GPU平台的BNM并行算法；Step 110, according to CUDA programming technology, design the BNM parallel algorithm based on GPU platform;

步骤120，根据CUDA的优化技术，同时结合采用的GPU型号的参数以及图像大小和坏块率的高低，对步骤110中设计的BNM并行算法进行改进；Step 120, improve the BNM parallel algorithm designed in step 110 according to the optimization technology of CUDA, in combination with the parameters of the GPU model adopted and the image size and the level of bad block rate;

步骤130，根据步骤120的优化结果，找到其中性能最好的算法对图像进行恢复。Step 130, according to the optimization result of step 120, find the algorithm with the best performance to restore the image.

所述的图像恢复方法，所述步骤120中，采用如下方法之一或其组合进行优化：In the image restoration method, in the step 120, one of the following methods or a combination thereof is used for optimization:

A1，根据所选GPU的资源，设置并行程序的网格、块、线程的结构，使之能够隐藏全局存储器的访问延迟。A1, according to the resources of the selected GPU, set the grid, block, and thread structures of the parallel program so that it can hide the access delay of the global memory.

A2，同一个warp的线程访问的全局存储器的数据满足合并访问。A2, the data in the global memory accessed by threads of the same warp satisfies coalescing access.

A3，使用共享存储器存放同一个线程块中每个线程多次使用到的数据，通过共享存储器的使用来减少对全局存储器的访问。A3, use shared memory to store data used multiple times by each thread in the same thread block, and reduce access to global memory through the use of shared memory.

A4，在GPU上的每个流处理器组SM上有大量的寄存器，通过充分的使用寄存器来存放一个线程中多次用到的数据来减少对全局存储器的访问。A4, each stream processor group SM on the GPU has a large number of registers, and the access to the global memory is reduced by fully using the registers to store data used multiple times in a thread.

A5，BNM算法中的搜索半径决定了恢复的效果即PSNR值，设定搜索半径既满足高的质量，又有性能的提升；同时合理设定搜索时的阈值也可以提高恢复的速度。A5. The search radius in the BNM algorithm determines the recovery effect, that is, the PSNR value. Setting the search radius not only meets high quality, but also improves performance; at the same time, setting the search threshold reasonably can also improve the recovery speed.

所述的图像恢复方法，步骤A1中，所述的GPU的资源是指所选用的GPU中流处理器SP总数，全局存储器的容量、访问延迟，计算能力，每个网格允许的最大块数，每个块允许的最大线程数，每个流处理器组SM允许的最大线程数。In the image restoration method, in step A1, the resource of the GPU refers to the total number of stream processors SP in the selected GPU, the capacity of the global memory, the access delay, the computing power, the maximum number of blocks allowed by each grid, The maximum number of threads allowed per block, the maximum number of threads allowed per stream processor group SM.

所述的图像恢复方法，步骤A2中，所述的warp指在同一个线程块中同时运行的一组线程(32个线程)，warp是线程的调度单位。In the image restoration method, in step A2, the warp refers to a group of threads (32 threads) running simultaneously in the same thread block, and a warp is a thread scheduling unit.

所述的图像恢复方法，步骤A3中，所述的共享存储器指GPU上的一种cache结构，共享存储器位于GPU片内，其访问速度快，充分的使用共享存储器可以提高并行的计算速度。In the image restoration method, in step A3, the shared memory refers to a cache structure on the GPU. The shared memory is located in the GPU chip, and its access speed is fast. Fully using the shared memory can improve the parallel computing speed.

所述的图像恢复方法，步骤A4中，所述的阈值指在BNM搜索过程中给定一个特定的值，一旦求得的MSE的值达到阈值后就停止搜索。In the image restoration method, in step A4, the threshold means that a specific value is given during the BNM search process, and the search is stopped once the obtained MSE value reaches the threshold.

本发明基于NVIDIA GPU平台，采用CUDA并行编程技术，实现BNM的并行图像恢复算法。同时，由于传统的串行BNM算法执行较慢，无法满足视频播放中的实时性恢复，本发明可以满足高清图像的实时性恢复。The invention is based on the NVIDIA GPU platform and adopts CUDA parallel programming technology to realize the parallel image restoration algorithm of BNM. At the same time, because the traditional serial BNM algorithm is slow in execution, it cannot meet the real-time recovery in video playback, but the present invention can meet the real-time recovery of high-definition images.

附图说明Description of drawings

图1BNM算法的操作过程；The operation process of Fig. 1BNM algorithm;

图2使用合并存储器访问的并行BNM算法结构图；Fig. 2 uses the parallel BNM algorithm structural diagram of combining memory access;

图3使用共享存储器的并行BNM算法结构图；Fig. 3 uses the parallel BNM algorithm structural diagram of shared memory;

图4使用共享存储器的并行BNM算法结构图；Fig. 4 uses the parallel BNM algorithm structural diagram of shared memory;

图5本发明的具体实施结构图。Fig. 5 is a structural diagram of the specific implementation of the present invention.

具体实施方式Detailed ways

以下结合附图和优选实施例对本发明的技术方案进行详细地阐述。以下的实施例仅用于说明和解释本发明，而不构成对本发明技术方案的限制。The technical solutions of the present invention will be described in detail below in conjunction with the accompanying drawings and preferred embodiments. The following examples are only used to illustrate and explain the present invention, but not to limit the technical solution of the present invention.

本发明主要实现基于GPU平台的并行图像恢复算法，具体实施过程中主要是合理使用各种CUDA中的优化技术，以及结合具体所选用的GPU型号。The present invention mainly realizes the parallel image restoration algorithm based on the GPU platform, and in the specific implementation process, it mainly uses various optimization techniques in CUDA rationally, and combines the specifically selected GPU models.

例如我们选择NVIDIA GeForce GTX275的GPU和Intel公司的i7920CPU，详细参数如表1所示。For example, we choose the GPU of NVIDIA GeForce GTX275 and the i7920CPU of Intel Corporation. The detailed parameters are shown in Table 1.

表1测试环境Table 1 Test environment

参见图5，基于GPU平台的BNM并行图像恢复过程包括以下步骤：Referring to Figure 5, the BNM parallel image restoration process based on the GPU platform includes the following steps:

例如，我们选择一个1024*1024像素的图像进行测试，假设坏块大小为8*8像素(x1*x2)，坏块率为15％，那么坏块总数为n＝(1024*1024)/(8*8)*15％＝2458，界限窗口为(x1+2)*(x2+2)＝(8+2)*(8+2)＝10*10像素，设搜索窗口为r1*r2＝80*80像素。For example, we choose an image of 1024*1024 pixels for testing, assuming that the bad block size is 8*8 pixels (x1*x2), and the bad block rate is 15%, then the total number of bad blocks is n=(1024*1024)/( 8*8)*15%=2458, the boundary window is (x1+2)*(x2+2)=(8+2)*(8+2)=10*10 pixels, and the search window is r1*r2= 80*80 pixels.

针对上面的例子，设计P_BNM算法，每个线程处理一个坏块的恢复，即在80*80的搜索窗口中找到最佳匹配的块(MSE值最小的块)，并进行替换。在搜索的过程中，每次移动一个像素，因此，每个线程要比较80*80个块与坏块的匹配(计算两者的MSE值)。设每个线程块的线程数为m＝128，那么总线程块数为(n+m-1)/m＝(2458+128-1)/128＝20。P_BNM算法的性能只有1.42倍的提高(Intel i7vs GTX 275)，主要因为同一个warp内的线程访问的数据不能满足合并访问，并且只有20个线程块，而一个流处理器组SM上有30个流处理器，不利于资源的合理利用。For the above example, the P_BNM algorithm is designed, and each thread handles the recovery of a bad block, that is, finds the best matching block (block with the smallest MSE value) in the 80*80 search window, and replaces it. In the process of searching, one pixel is moved at a time, so each thread has to compare 80*80 blocks with bad blocks (calculate the MSE value of the two). Suppose the number of threads in each thread block is m=128, then the total number of thread blocks is (n+m-1)/m=(2458+128-1)/128=20. The performance of the P_BNM algorithm is only 1.42 times higher (Intel i7vs GTX 275), mainly because the data accessed by threads in the same warp cannot meet the combined access, and there are only 20 thread blocks, while there are 30 on a stream processor group SM Stream processors are not conducive to the rational use of resources.

这种设计方式下，有多少坏块就有多少个线程，设坏块数为n，每个线程块的线程数为m(m由所选择的GPU型号来确定其范围，并且不同型号的GPU允许的每个线程块的线程数的最大值也不相同)，则每个网格的线程块数为(n+m-1)/m(取整)。即线程块的设计为(n，1，1)，网格的设计为((n+m-1)/m，1，1)In this design method, there are as many threads as there are bad blocks, assuming that the number of bad blocks is n, and the number of threads in each thread block is m (the range of m is determined by the selected GPU model, and different types of GPU The maximum number of threads allowed for each thread block is also different), then the number of thread blocks in each grid is (n+m-1)/m (rounded). That is, the design of the thread block is (n, 1, 1), and the design of the grid is ((n+m-1)/m, 1, 1)

在本发明所提出的基于GPU平台的BNM并行算法，采用CUDA编程技术中优化方法，通过优化技术来提高串行BNM算法的执行性能，具体优化过程举例如下：In the BNM parallel algorithm based on the GPU platform proposed by the present invention, the optimization method in the CUDA programming technology is adopted, and the execution performance of the serial BNM algorithm is improved by the optimization technology, and the specific optimization process is given as an example as follows:

P_BNM_CA算法：P_BNM_CA algorithm:

P_BNM算法不考虑CUDA的优化技术，仅进行简单的并行。由于P_BNM算法不能满足合并访问，性能不是很好，可以通过进一步的细分每个线程的任务满足合并访问——P_BNM_CA算法。The P_BNM algorithm does not consider the optimization technology of CUDA, and only performs simple parallelism. Since the P_BNM algorithm cannot meet the merged access, the performance is not very good, and the merged access can be satisfied by further subdividing the tasks of each thread - the P_BNM_CA algorithm.

在使用P_BNM_CA算法对图像进行恢复时，在一幅有坏块(每个坏块是一个n1*n2的块)的图像中，坏块的分布呈现随机性，所以坏块与坏块之间的数据没有任何依赖性，这符合了GPU并行处理的特点。同时为了满足全局存储器的合并访问，如图2中，本发明设置坏块的数目即为grid中的线程块数，并且每个线程块处理一个坏块，每个线程块的线程数为BNM算法中的搜索半径(设BNM算法的搜索半径为R，则每个线程块的线程数也为R)，每个线程处理搜索范围中的一列，从而构成同一个warp的32个线程访问的数据就能满足全局存储器合并访问的方式。设一个线程块处理一个坏块的恢复过程，每个线程块大小为80，每个线程块块内的每个线程只处理80*80搜索范围内的一列的搜索工作，每个线程找到一列中MSE最小的块，同一个线程块内的80个线程再取MSE最小的块就为最佳匹配的块。总线程块数为n＝2458。When using the P_BNM_CA algorithm to restore an image, in an image with bad blocks (each bad block is an n1*n2 block), the distribution of bad blocks is random, so the distance between bad blocks and bad blocks The data does not have any dependencies, which conforms to the characteristics of GPU parallel processing. Simultaneously in order to meet the combined access of the global memory, as shown in Figure 2, the number of bad blocks set by the present invention is the number of thread blocks in the grid, and each thread block processes a bad block, and the number of threads of each thread block is the BNM algorithm The search radius in (assuming that the search radius of the BNM algorithm is R, the number of threads in each thread block is also R), each thread processes a column in the search range, so that the data accessed by the 32 threads constituting the same warp is A method that can satisfy global memory coalescing access. Suppose a thread block handles the recovery process of a bad block. The size of each thread block is 80. Each thread in each thread block only processes the search work of one column within the 80*80 search range. Each thread finds a column in a column. The block with the smallest MSE, 80 threads in the same thread block and the block with the smallest MSE are the best matching blocks. The total number of thread blocks is n=2458.

该方法使得同一个warp内的线程每次访问的搜索窗口内的数据具有连续性，因此很好的满足合并访问，同时，总线程块数为2458，可以很好的利用流处理器组SM的计算能力，该算法的性能达到基本BNM并行算法18.38倍。This method makes the data in the search window accessed by threads in the same warp have continuity, so it satisfies the combined access well. At the same time, the total number of thread blocks is 2458, which can make good use of the stream processor group SM. Computing power, the performance of this algorithm reaches 18.38 times of the basic BNM parallel algorithm.

对每个坏块的恢复过程中，在80*80的搜索窗口中，要计算80*80个替换窗口与坏块窗口的MSE，而坏块窗口的像素值是每次都要使用，8*8窗口一周圈的像素个数为36，因此，我们可以采用共享存储器存放这36个像素值，该方法为P_BNM_CA_SH，该方法的性能提高到22.22倍。In the recovery process of each bad block, in the 80*80 search window, it is necessary to calculate the MSE of 80*80 replacement windows and the bad block window, and the pixel value of the bad block window is used every time, 8* The number of pixels in a circle of 8 windows is 36. Therefore, we can use shared memory to store these 36 pixel values. The method is P_BNM_CA_SH, and the performance of this method is increased to 22.22 times.

P_BNM_CA_SH算法：P_BNM_CA_SH algorithm:

CUDA编程模型建议使用共享存储器来提高性能。在串行BNM算法中，R*R的搜索范围内的每次搜索都要进行与n1*n2的坏块周围的(n1+n2+2)*2个像素点进行比较，求其MSE值。如图3中，P_BNM_CA_SH在采用合并存储器访问技术的P_BNM_CA算法的基础上把存放这(n1+n2+2)*2个像素点的值传递到共享存储器中，通过减少对全局共享存储器的访问来提高性能。The CUDA programming model recommends the use of shared memory to improve performance. In the serial BNM algorithm, each search within the search range of R*R must be compared with (n1+n2+2)*2 pixels around the n1*n2 bad block to find its MSE value. As shown in Figure 3, P_BNM_CA_SH transfers the value of storing these (n1+n2+2)*2 pixels to the shared memory on the basis of the P_BNM_CA algorithm using the combined memory access technology, and reduces the access to the global shared memory. Improve performance.

另外，如图4中所示，在每次的比较过程中，上行和下行的像素点都会被访问多次(一般为n1次)，因此可以使用共享存储器来存放这些像素点来减少对全局存储器的访问，提高性能。In addition, as shown in Figure 4, in each comparison process, the pixels in the upper row and the lower row will be accessed multiple times (generally n1 times), so shared memory can be used to store these pixels to reduce the need for global memory access to improve performance.

P_BNM_CA_SH_R算法：P_BNM_CA_SH_R algorithm:

基于P_BNM_CA_SH方法，可以通过修改搜索半径(同时合理设定搜索时的阈值也可以提高恢复的速度)进一步提高性能——P_BNM_CA_SH_R，适当的减小搜索半径并不会对图像恢复的质量有明显的降低。针对上面的例子，把搜索半径改为40像素时，性能提高到61.23倍，而PSNR值由34.70降到34.56，仅降低了0.14。利用P_BNM_CA_SH_R恢复1024*1024像素的图像，仅用了28ms的时间，很好的满足实时性恢复。Based on the P_BNM_CA_SH method, the performance can be further improved by modifying the search radius (at the same time, setting the threshold reasonably during the search can also improve the speed of recovery)——P_BNM_CA_SH_R. Appropriately reducing the search radius will not significantly reduce the quality of image restoration . For the above example, when the search radius is changed to 40 pixels, the performance is increased to 61.23 times, while the PSNR value is reduced from 34.70 to 34.56, which is only reduced by 0.14. Using P_BNM_CA_SH_R to restore the 1024*1024 pixel image takes only 28ms, which satisfies the real-time restoration very well.

步骤130，根据上述的优化步骤，找到其中性能最好的算法对图像进行恢复。Step 130, according to the above optimization steps, find the algorithm with the best performance to restore the image.

针对上面的例子，P_BNM_CA_SH_R方法的性能最好，因此，可以采用P_BNM_CA_SH_R的方法恢复1024*1024像素大小的破坏为15％的图像。For the above example, the P_BNM_CA_SH_R method has the best performance. Therefore, the P_BNM_CA_SH_R method can be used to restore the 1024*1024 pixel-sized image with 15% damage.

由本发明给出的上述测试结果可以看出，利用GPU强大的并行计算能力，可以数十倍、百倍地加速图像的恢复运算。It can be seen from the above test results provided by the present invention that the powerful parallel computing capability of the GPU can be used to speed up image restoration operations by tens or hundreds of times.

以上说明仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权力要求书的保护范围为准。The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention, All should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. the best neighborhood matching parallel image restoration methods based on NVIDIA GPU platform is characterized in that, may further comprise the steps:

Step 100 scans the image that bad piece is arranged, and finds the local of bad piece and it is carried out mark;

Step 110, according to the CUDA programming technique, design is based on the BNM parallel algorithm of GPU platform;

Step 120, according to the optimisation technique of CUDA, the height of the parameter of the GPU model that combination is simultaneously adopted and image size and bad piece rate improves the BNM parallel algorithm of design in the step 110;

Step 130, according to the optimization result of step 120, finding wherein, the best algorithm of performance recovers image.

2. image recovery method according to claim 1 is characterized in that, in the described step 120, adopts one of following method or its combination to be optimized:

A1 according to the resource of selected GPU, is provided with grid, the piece of concurrent program, the structure of thread, enables to hide the access delay of global storage.

A2, the data of the global storage of the thread accesses of same warp satisfy to merge visits.

A3 uses shared storage to deposit the data that each thread repeatedly uses in the same thread block, is used for reducing visit to global storage by making of shared storage.

A4 has a large amount of registers on each the stream handle group SM on the GPU, deposit the data of repeatedly using in the thread by sufficient use register and reduce visit to global storage.

A5, the search radius in the BNM algorithm have determined that the effect of recovering is the PSNR value, and the setting search radius had both satisfied high quality, and the lifting of performance is arranged again; Simultaneously rationally the threshold value during setting search also can improve the speed of recovery.

3. image recovery method according to claim 2, it is characterized in that, in the steps A 1, the resource of described GPU is meant stream handle SP sum among the selected GPU, the capacity of global storage, access delay, computing power, the largest block number that each grid allows, the maximum thread that each piece allows, the maximum thread that each stream handle group SM allows.

4. image recovery method according to claim 2 is characterized in that, in the steps A 2, described warp refers in same thread block one group of thread (32 threads) of operation simultaneously, and warp is the thread of thread.

5. image recovery method according to claim 2 is characterized in that, in the steps A 3, described shared storage refers to a kind of cache structure on the GPU, shared storage is positioned at the GPU sheet, and its access speed is fast, uses shared storage can improve parallel computing velocity fully.

6. image recovery method according to claim 2 is characterized in that, in the step f), described PSNR value refers to the signal to noise ratio (S/N ratio) of image.

7. image recovery method according to claim 2 is characterized in that, in the step f), described threshold value refers to a given specific value in the BNM search procedure, in case the value of the MSE that tries to achieve just stops search after reaching threshold value.

8. image recovery method according to claim 2 is characterized in that MSE is meant the mean variance of the piece of two identical sizes in the image.