CN105550974A - GPU-based acceleration method of image feature extraction algorithm - Google Patents

GPU-based acceleration method of image feature extraction algorithm Download PDF

Info

Publication number
CN105550974A
CN105550974A CN201510915260.6A CN201510915260A CN105550974A CN 105550974 A CN105550974 A CN 105550974A CN 201510915260 A CN201510915260 A CN 201510915260A CN 105550974 A CN105550974 A CN 105550974A
Authority
CN
China
Prior art keywords
gpu
algorithm
image
cpu
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510915260.6A
Other languages
Chinese (zh)
Inventor
张为华
鲁云萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201510915260.6A priority Critical patent/CN105550974A/en
Publication of CN105550974A publication Critical patent/CN105550974A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/28Indexing scheme for image data processing or generation, in general involving image processing hardware

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention belongs to the parallel processor technical field and relates to a GPU-based acceleration method of an image feature extraction algorithm. According to the GPU-based acceleration method of the invention, fine-granularity parallel implementation of existing main image feature extraction algorithms is performed on GPUs, and optimized acceleration can be performed according to the features of the GPUs; collaborative work mechanisms of asynchronous assembly lines are adopted to make the GPUs perform collaborative calculation; as indicated by a test result, when hardware is configured as an Intel Q8300 CPU and a GTX260 GPU, the speed of the algorithm is 172.33 frames per second, and is 67 times of the speed of a serial algorithm; and when the hardware is configured as an IntelI7 CPU and a GTX295 GPU, the speed of the algorithm is as high as 340.47 frames per second; and therefore, the requirements of real-time processing can be better satisfied.

Description

Based on the accelerated method of the image characteristics extraction algorithm of GPU
Technical field
The invention belongs to technical field of parallel algorithm, being specifically related to a kind of is the accelerated method of image characteristics extraction algorithm.
Background technology
Along with the mankind enter digital Age, have every day and produce from the data of different field in a large number.Wherein, multimedia data type, as image, video etc., has become one of main data type.How in increasing image/video data, Effective selection information has received and has more and more studied concern.Compared to the application of traditional text class, the application centered by multi-medium data, as search engine, filtering system, copy detection etc., has functional need more and more widely.Wherein, the important foundation algorithm that image characteristics extraction algorithm screens as image/video information retrieval, the information that effectively can extract key frame in image or video, for the comparison between image or video, is widely used as main algorithm in the application of this type systematic.
From computation process, image retrieval algorithm can be divided into feature extraction phases and characteristic matching stage.Feature extraction phases according to feature extraction algorithm by the feature extraction of image out, comprises the color characteristic of image, textural characteristics, or the unique point in image (point as bright especially) etc.With the feature comparing two width images, the characteristic matching stage judges whether two width images mate.From accuracy, the algorithm based on local feature uses hundreds and thousands of features to represent an image (unique point as in image), and thus accuracy is high, use by increasing application system.The local feature algorithm of main flow has SIFT, SURF algorithm etc. now.But because this algorithm not only needs data volume to be processed very huge, and algorithm itself is also very complicated, is thus limited by very large in processing speed, then can not meet the real-time requirement of user in some applications.Therefore, the processing speed how effectively promoting local shape factor algorithm becomes a study hotspot of current architectural and searching field.
Recent years, along with the development of semiconductor technology and the universal of multi-core technology, various concurrent computational system becomes the main flow of application design gradually.Along with graphics processing unit GPU(generalpurposeprocessors) enhancing of versatility and programmability, it also becomes one of them indispensable ingredient.Modern GPU is not only a simple image processing engine, the programmable processor of a highly-parallel especially.Compared to CPU, when identical number of transistors, the more transistor of GPU calculates.The essence that altitude information walks abreast makes GPU have more powerful arithmetic processing ability, and meanwhile, it also has higher bandwidth of memory and more cheap price, and this is that it brings great competitive power at high-performance computing sector.
There is multiple parallel mode in the local shape factor algorithm in image retrieval, this is that its Parallel Implementation on GPU provides possibility.Meanwhile, the powerful calculating ability of GPU, for the performance improving local shape factor algorithm provides strong basis.Therefore, the present invention is devoted to utilize GPU to be optimized acceleration to the image retrieval algorithm based on local shape factor algorithm.
Summary of the invention
The object of the invention is to provide a kind of method optimizing acceleration to the image characteristics extraction algorithm of current main flow.
Method image characteristics extraction algorithm being optimized to acceleration provided by the invention, mainly make use of GPU technology.Namely the present invention gives in fine granularity ground Parallel Implementation image characteristics extraction algorithm on GPU, and is optimized according to the characteristic of GPU; Meanwhile, adopt the cooperative mechanism of asynchronous pipeline to make CPU and GPU collaborative work, thus improve treatment effeciency further.
First the present invention realizes to fine granularity the parallel algorithm of image characteristics extraction on GPU, and so-called fine granularity refers to that each stage at local shape factor algorithm is respectively with minimum granularity, as by each unique point, carrys out development data and walks abreast.The fine-grained parallel computation capability that can play GPU more fully.The present invention uses CUDA(ComputeUnifiedDeviceArchitecture) programming model, this fine-grained data parallel is mapped on GPU and calculates.Selected local feature searching algorithm is the searching algorithm SURF(Speeded-UpRobustFeatures of current main flow).Before describing concrete Parallel Implementation, we first introduce the main algorithm process of local feature searching algorithm, and then illustrate how each stage is achieved on GPU.
Local feature searching algorithm detects characteristics of image and is described these features, it mainly in three stages: image initial, feature detection and feature interpretation, as shown in Figure 1.
Image initial, is divided in its process and is loaded into image, calculating gray level image, calculated product partial image three step.Feature detection portion, then utilize integral image to detect the unique point (such as point bright especially in dim spot) of image.Then characterizing part uses specific data structure to be described the unique point found, with the process to image after facilitating.In local feature searching algorithm, concentrate on feature detection and feature interpretation two parts main computing time, calculate so these two parts are placed on GPU by the present invention.
Described feature detection, first will calculate the eigenwert of each point.In SURF algorithm, the concept of sampling is adopted to avoid the eigenwert calculating each pixel.Time on this session map to GPU, because the eigenwert of each point calculates relatively independent, therefore, each sampling pixel points is a GPU thread (the upper each stage number of threads of GPU is as shown in Fig. 1 parenthesis).Owing to there being more judgement statement in this stage, the performance of GPU can be affected, so the present invention uses o kernel cycle calculations to calculate to avoid the branch on GPU, and o is the image number of plies in SURF algorithm.After the eigenwert calculating of each point is complete, namely enter the positioning feature point stage.This stage (comprises 2 image layer) in every 8 consecutive point, and selected characteristic is worth maximum point as optional unique point.Therefore, every 8 points are that a GPU thread calculates.
Described feature interpretation, can be divided into and calculate Haar wavelet transform value, calculating characteristic direction, establishment characteristic window three step.First define taking unique point as the center of circle in SURF algorithm, 6*scale is that points all in the circle of radius calculates Harr wavelet transformation, therefore have 109 points around each unique point, when we are mapped on GPU, each thread correspondence calculates the Haar wavelet transformation of a point.Then, SURF algorithm calculate characteristic direction time with 0.15 radian for direction area size, obtain 2* π/0.15 ≈ 42 characteristic direction regions, around put for 109 and vote to these 42 characteristic areas, that direction that poll is maximum is the direction of this unique point.Each characteristic area is a GPU thread in this stage.Finally, each unique point is by generation 64 dimensional feature vectors, and the calculating of this vector is calculated by the region of 4*4 around unique point, and therefore, each unique point is divided into again 16 GPU threads to calculate.
On the basis of fine grained parallel, the present invention also utilizes the characteristic of GPU, is optimized acceleration further.Specifically utilize GPU internal memory and the feature of GPU texture storer when process 2 dimension data to improve algorithm performance, and reduce algorithm as much as possible to the distribution repeatedly of internal memory and release.
GPU texture storer provides for two dimension on hardware, the support of three-dimensional locality.Namely, for the two-dimensional array on texture memory, when an access pixel, its pixel up and down all can be placed on the buffer memory of texture memory simultaneously.There is the calculating of a large amount of approximate integration fast in SURF algorithm, these calculate needs access integration array, and this access has obvious two-dimensional locality.The internal storage access of the Texture memory of GPU to 2-D data is utilized to have good performance boost.Concrete two-dimensional array is tied on GPU texture storer, can with in CUDA cudaBindTexture2Dmethod.
Meanwhile, in algorithm when storing original image, integral image, eigenwert and its dependent variable, be all to start and random memory is carried out in ending often opening image procossing.When batch processing picture, such distribution and release GPU internal memory are redundancy and affect performance.The present invention just distributes fixing internal memory in the program starting stage, thus reduces unnecessary memory allocation and release.
The present invention also makes CPU and GPU with the mode collaborative work of asynchronous pipeline.Traditionally the flow process of CPU and GPU cooperated computing is: (1) CPU calculates the data needing to provide; (2) data are passed to GPU internal memory from CPU internal memory by CPU; (3) GPU calculates.In such flow process, GPU is when calculating, and cpu resource is idle, and similarly, GPU also always wants waiting for CPU to calculate and just can process after transferring data.Be equivalent to CPU calculate, data transmission and GPU calculate and are carrying out in a serial fashion.In order to enable for three part execution time overlapping thus improve performance further, the present invention utilizes the mode of asynchronous pipeline to realize.
The mode of so-called streamline refers to and whole algorithm is divided into two parts, the calculating of CPU process Part I, the calculating of GPU process Part II, and data are transmitted in a streaming manner between CPU and GPU, reach the object of two hardware concurrent workings.That is, during the calculating of CPU pictures after doing, GPU can process the calculating of last pictures simultaneously, reaches the effect of parallel computation with this.In the specific implementation of this technology, the calculating of CPU specialize initialization data, and data are imported into GPU internal memory, GPU then reads the data such as integral image from internal memory, and carries out feature detection and feature interpretation calculates.Due to the speed of CPU initialisation image and the speed of GPU characteristic key and description roughly the same, therefore CPU and GPU can keep parallel computation effectively.Meanwhile, because GPU supports DMA asynchronous transmission, and streamline combines, and can be also piled up the time of tcp data segment further, namely forms the streamline that CPU calculating, data transmission and GPU calculate three phases.Figuratively, namely when CPU calculates in the initialization carrying out the i-th pictures, the initialization data of the i-th-1 pictures is passed in GPU internal memory simultaneously and go, and GPU is now processing feature detection and the description of the i-th-2 pictures.By the synergistic mechanism of this asynchronous pipeline, CPU calculates, and data transmission and GPU calculate effectively to walk abreast and carry out, thus improve performance.
In addition, the present invention also takes full advantage of the surplus resources of CPU, allows unnecessary core also complete independently some algorithm, to improve performance.Because of popularizing of polycaryon processor, we also consider that the residue computational resource of CPU is also deserved to utilize.Suppose it is the CPU of 4 cores, a core is for control GPU and data transmission, and another core is CPU and calculates, and two so other cores just can complete the feature extraction and calculation of part picture independently, as shown in Figure 2.
Test result shows, when hardware configuration is the GPU of CPU and GTX260 of IntelQ8300, algorithm speed of the present invention is 172.33 frames/second, is the 67X of serial.And when hardware configuration is the GPU for CPU and GTX295 of IntelI7, speed, up to 340.47 frames/second, can meet real-time processing requirements.
Accompanying drawing explanation
Fig. 1 realizes schematic diagram based on the image characteristics extraction algorithm fine granularity of GPU.Wherein, grey is the calculating section on GPU, and white is the calculating section on CPU.In each stage, the number in parenthesis represents this stage on GPU with how many thread computes.If there are square bracket, then number wherein represents that the calculating in this stage is divided into how many kernels to complete.
Fig. 2 is GPU and CPU asynchronous flowing water coordination mechanism schematic diagram.
Fig. 3 is performance test figure.
Fig. 4 is buildDetthe specific code of method.
Fig. 5 is buildDetKernelthe partial code of method.
Fig. 6 is the partial code of streamline.
Embodiment
Below in conjunction with the source code in accompanying drawing explanation of the present invention and program, the technology in the present invention is described in detail.The present invention mainly utilizes CUDA programming model, by the local shape factor algorithm in image characteristics extraction algorithm, be mapped on GPU with fine-grained parallel mode and calculate, and optimize the realization of algorithm on GPU further by the collaborative work mode of GPU characteristic and GPU and CPU.The local feature searching algorithm that the present invention chooses is the searching algorithm SURF(Speeded-UpRobustFeatures of current main flow).We will describe the concrete realization of this technology in detail below, and test the performance of this technology.
(1) fine-grained GPU realizes
The present invention calculates in the feature detection in image retrieval algorithm and feature interpretation two session map to GPU.As described above and shown in Fig. 1, these two stages can be divided into several little step.When calculating eigenwert, we are parallel granularity with each sampling pixel points.And during positioning feature point with every 8 sampling pixel points for parallel granularity.When calculating haar wavelet transform, have 109 pixels to need to calculate around each unique point, now each surrounding pixel point is a GPU thread.Then, around put for these 109 and will vote to 42 characteristic direction regions of unique point, therefore characteristic direction calculation stages is a GPU thread with each characteristic direction region.Finally, when each unique point generates vector, calculating by the region of 4*4 around this unique point, is a GPU thread with each region in this process.
In concrete codes implement, we use CUDA programming model, and herein only to calculate eigenwert step as an example during feature detection, the implementation method of other steps is all similar.What relate to eigenwert calculating mainly contains two methods:
voidbuildDet(float*m_det,intwidth,intheight);
__global__voidbuildDetKernel(float*m_det,into,intborder,intstep);
Wherein, buildDetmethod inside can be called buildDetKernelmethod.Due to buildDetKernelmethod is _ _ global__, therefore the method is run on GPU. m_detbe one and point to the pointer of array, be used for depositing the eigenwert of image, this array space just should distribute before GPU method call the method on GPU.On GPU, the method for allocation space is cudaMalloc.Meanwhile, some data as the width of image and height, and the pixel value of former integral image also need to pass to GPU internal memory on use for during subsequent calculations, can use for some simple parameters cudaMemcpyToSymbolrealize, can first use for the array that data volume is larger cudaMallocmethod distributes one section of space on GPU, and then uses cudaMemcpymethod passes to GPU internal memory data from CPU internal memory and gets on.
buildDetthe specific code of method realizes as shown in Figure 4, and the statement underscore wherein paid particular attention to has showed.Each for image octave, need to calculate each sampling (with stepfor interval is chosen) eigenwert of pixel.The concrete calculating of eigenwert exists buildDetKernelin method.We are divided into width * original image bLOCK_W*BLOCK_Hlittle one by one image block, our acquiescence width and height parameter are 16*8.4 interval are had again, so namely there be 16*8*4 sampling pixel points in each image block in each image block.The concept of Block in the corresponding CUDAGPU programming of each image block, the GPU thread in same Block is shared GPUsharedmemory. threads ()method 16*8*4 the pixel defined in each image block is an independently thread, and namely in fact calculating eigenwert is parallel granularity with sampling pixel points.And like this, altogether just have ((ws+BLOCK_W-1)/BLOCK_W) * ((hs+BLOCK_H-1)/BLOCK_H)individual Block, ws and hs remove border and the picture traverse of point do not sampled and height.After data have all divided, can call buildDetKernelmethod.
buildDetKernelcalculating in method runs on GPU, and each GPU thread runs identical code, and the data input just calculated is different. buildDetKernelas shown in Figure 5, method will calculate which pixel in the specifically image of this thread computes to the partial code of method at the beginning, by c-column, r-row, i-interval identifies.
(2) internal memory optimization of GPU characteristic is utilized
Realize on the basis of version at GPU fine grained parallel, the present invention also utilizes the memory behavior of GPU to be further optimized to algorithm.Two aspects are mainly comprised to the internal memory optimization of GPU: one is the support that GPU texture internal memory provides for two-dimensional locality, can do for the two-dimensional locality feature logarithm group access existed in algorithm and optimize.Two is reduce the expense to random memory in image processing process.
GPU texture storer provides for two dimension on hardware, the support of three-dimensional locality.And in SURF algorithm, there is the calculating of a large amount of approximate integration fast, there is obvious two-dimensional locality.The internal storage access of the Texture memory of GPU to 2-D data is utilized to have good performance boost.Concrete two-dimensional array is tied on Texture memory, can with in CUDA cudaBindTexture2Dmethod.Integral image array has been tied on 2D Texture memory by the present invention, thus can utilize the locality support of 2D Texture memory.
Meanwhile, in algorithm when storing original image, integral image, eigenwert and its dependent variable, be all to start and random memory is carried out in ending often opening image procossing.When batch processing picture, such distribution and release GPU internal memory are redundancy and affect performance.The present invention just distributes fixing internal memory in the program starting stage, thus reduces unnecessary memory allocation and release.
(3) synergistic mechanism of CPU and GPU
Although the computing power of GPU is powerful, GPU still needs CPU to carry out auxiliary control when calculating, therefore, how to work in coordination with the work of CPU and GPU full blast, then become the key factor affecting performance.CPU and GPU synergistic mechanism of the present invention mainly comprises two aspects, as shown in Figure 2: (1) makes CPU and GPU concurrent working by the mode of asynchronous pipeline.(2) the residue computing power of CPU is fully excavated.
The mode of so-called streamline refers to and whole algorithm is divided into two parts, the calculating of CPU process Part I, the calculating of GPU process Part II, and data are transmitted in a streaming manner between CPU and GPU, reach the object of two hardware concurrent workings.That is, during the calculating of CPU pictures after doing, GPU can process the calculating of last pictures simultaneously, reaches the effect of parallel computation with this.In our streamline realization, CPU mainly does the initialization loading work of image, and GPU does the calculating of feature detection and description, and the partial code of streamline as shown in Figure 6.Wherein:
Left figure is that CPU Loads Image the partial code of data, and right figure is the partial code that GPU controls thread. indexrepresenting current is which CPU loads thread or which GPU controls thread, and each CPU loads the corresponding GPU of thread and controls thread (calculating of each GPU control line process control GPU). idthe identifier of thread, gPUNUMfor the number of GPU thread, in our realization, front gPUNUMindividual thread is that CPU loads thread, after gPUNUMindividual thread is that GPU controls thread.Concerning each thread, its traversal access picture, multiple loading thread or multiple GPU control thread and the mode with roundrobin are visited.Threads are loaded, the picture of first thread accesses odd number number, the thread of second thread accesses even numbers number for two.Load in thread, int_imgfor the image data structure after loaded, comprising picture width, highly with the value of each pixel in picture.Pixel value is a very large array, if each picture will redistribute internal memory, affects performance, and we create a buffer memory in the process of internal memory optimization, and the inside distributes in advance iMGBUFSIZEthe array of individual picture size, the array in buffer memory is recycled, at every turn during a newly-built image data structure, just arrays some in buffer memory is distributed to this int_img, and array is reset, thus reduce the time of Memory Allocation.After Data import, int_imgbe put into imgs_gin this array, and by correspondence flag_gbe set to 1, imgs_garray and flag_garray size is all iMGSIZE, and can by GPU control thread access.GPU controls thread and accesses picture in order, when the i-th pictures flagjust wait for when not being 1, this flag becomes 1 by the time always, and namely picture is after loaded, calls CUDA method detDes (), calculative image data exists imgs_g [i]in.Often load a pictures loadnumber [index] ++, often complete the GPU computing of a pictures, processed [index] ++, when loadnumber [index]-processed [index]when being greater than cache size, show to load thread speed, load the calculating that thread need wait for GPU.
Because GPU supports DMA asynchronous transmission, and streamline combines, and can be also piled up the time of tcp data segment further, namely forms the streamline that CPU calculating, data transmission and GPU calculate three phases.Figuratively, namely when CPU calculates in the initialization carrying out the i-th pictures, the initialization data of the i-th-1 pictures is passed in GPU internal memory simultaneously and go, and GPU is now processing feature detection and the description of the i-th-2 pictures.Embodiment is as shown in above-mentioned code, and GPU controls, in thread, to calculate the i-th pictures to call CUDA method, just needs the i-th and i-th+1 pictures all loadeds.? detDes ()can use in method cudaMemcpy2DAsync ()method carrys out the data of asynchronous transmission i-th+1 pictures, thus the time that data are transmitted and GPU are piled up computing time.
In addition, the present invention also takes full advantage of the surplus resources of CPU, allows unnecessary core also complete independently some algorithm, to improve performance.Suppose it is the CPU of 4 cores, a core is for control GPU and data transmission, and another core is CPU and calculates, and two so other cores just can complete the feature extraction and calculation of part picture independently, as shown in Figure 2.
(4) performance test
The present invention also comprises detailed test result, and the main frame of test is the 4 core CPU of IntelQ8300, and memory size is 2GB.The GPU that test uses is GeForceGTX260, and it has 27 SM, altogether 216 cores, and clock rate is 1.24GHz, and video memory size is 1GB.Host operating system is Ubuntu8.10(linux kernel version is 2.6.27-7-generic).
As shown in Figure 3, local image characteristics extraction algorithm is when CPU serial performs, and often calculate a pictures and need 393 milliseconds, namely process 2.56 1 second, this speed is far smaller than real-time processing speed.CPU carries out SURF algorithm (calculating by the fastest block parallel) concurrently, and speed is promoted to 14.86 frames/second.And fine-grained GPU realizes, algorithm being brought up to 74.45 frames/second, is the 5X of parallel version on CPU.Through GPU internal memory optimization and the collaborative work with CPU, speed rises to 84.56 frames/second and 172.33 frames/second.This speed has reached real-time processing speed.
The present invention has also used hardware configuration to be CPU and GPUGTX295(480 the core of IntelI7) do further test, test result shows, under in such a configuration, the speed of local shape factor algorithm is up to 340.47 frames/second.

Claims (4)

1., based on an accelerated method for the image characteristics extraction algorithm of GPU, it is characterized in that:
First on GPU, realize to fine granularity the parallel algorithm of image characteristics extraction, so-called fine granularity refers to that each stage at local shape factor algorithm is respectively with minimum granularity, namely by each unique point, carrys out development data and walks abreast; Use CUDA programming model, this fine-grained data parallel is mapped on GPU and calculates; Described local feature searching algorithm adopts searching algorithm SURF;
Described local feature searching algorithm detects characteristics of image and is described these features, and it was divided into for three stages: image initial, feature detection and feature interpretation; Feature detection and feature interpretation two parts are placed on GPU and calculate;
Described image initial, is divided into and is loaded into image, calculating gray level image, calculated product partial image three step;
Described feature detection is the unique point utilizing integral image to detect image; Wherein, the eigenwert of each point is first calculated; Time on this session map to GPU, the eigenwert of each point calculates relatively independent, and each sampling pixel points is a GPU thread; Concrete use o kernel cycle calculations avoids the branch on GPU to calculate, and o is the image number of plies in SURF algorithm; After the eigenwert calculating of each point is complete, namely enter the positioning feature point stage; This stage selected characteristic in every 8 consecutive point is worth maximum point as optional unique point; Every 8 points are that a GPU thread calculates;
Described feature interpretation, uses specific data structure to be described the unique point found, with the process to image after facilitating; Wherein, first do haar wavelet transform to the pixel of 109 around each unique point, this stage, each pixel was around a GPU thread; Then, around put for these 109 and vote to 42 characteristic direction regions of unique point, that direction that poll is maximum is the direction of this unique point, and this stage each characteristic area is a GPU thread; Finally, each unique point generates 64 dimensional feature vectors, and the calculating of this vector is calculated by the region of 4*4 around unique point, and therefore, each unique point is divided into again 16 GPU threads to calculate.
2. the accelerated method of the image characteristics extraction algorithm based on GPU according to claim 1, it is characterized in that: utilize GPU internal memory and the feature of GPU texture storer when process 2 dimension data to improve algorithm performance, and reduce algorithm as much as possible to the distribution repeatedly of internal memory and release; Specifically for the two-dimensional array related in SURF algorithm, with in CUDA cudaBindTexture2Dmethod is tied on GPU texture storer;
Meanwhile, in SURF algorithm when storing original image, integral image, eigenwert and its dependent variable, all to start and random memory is carried out in ending often opening image procossing; When batch processing picture, just distribute fixing internal memory in the program starting stage, thus reduce unnecessary memory allocation and release.
3. the accelerated method of the image characteristics extraction algorithm based on GPU according to claim 2, is characterized in that: make CPU and GPU with the mode collaborative work of asynchronous pipeline;
The mode of so-called streamline refers to and whole algorithm is divided into two parts, the calculating of CPU process Part I, the calculating of GPU process Part II, and data are transmitted in a streaming manner between CPU and GPU, to realize two hardware concurrent workings; Wherein, the calculating of CPU specialize initialization data, and data are imported into GPU internal memory, GPU then reads the data such as integral image from internal memory, and carries out feature detection and feature interpretation calculates;
Meanwhile, because GPU supports DMA asynchronous transmission, and streamline combines, and is also piled up the time of tcp data segment further, namely forms the streamline that CPU calculating, data transmission and GPU calculate three phases.
4. the accelerated method of the image characteristics extraction algorithm based on GPU according to claim 3, is characterized in that: the surplus resources also utilizing CPU, the process of the core making CPU unnecessary also complete independently some algorithm.
CN201510915260.6A 2015-12-13 2015-12-13 GPU-based acceleration method of image feature extraction algorithm Pending CN105550974A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510915260.6A CN105550974A (en) 2015-12-13 2015-12-13 GPU-based acceleration method of image feature extraction algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510915260.6A CN105550974A (en) 2015-12-13 2015-12-13 GPU-based acceleration method of image feature extraction algorithm

Publications (1)

Publication Number Publication Date
CN105550974A true CN105550974A (en) 2016-05-04

Family

ID=55830150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510915260.6A Pending CN105550974A (en) 2015-12-13 2015-12-13 GPU-based acceleration method of image feature extraction algorithm

Country Status (1)

Country Link
CN (1) CN105550974A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067158A (en) * 2016-05-26 2016-11-02 东方网力科技股份有限公司 A kind of feature comparison method based on GPU and device
CN107122244A (en) * 2017-04-25 2017-09-01 华中科技大学 A kind of diagram data processing system and method based on many GPU
WO2018000724A1 (en) * 2016-06-28 2018-01-04 北京大学深圳研究生院 Cdvs extraction process acceleration method based on gpgpu platform
CN107809643A (en) * 2017-11-13 2018-03-16 郑州云海信息技术有限公司 A kind of coding/decoding method of image, device and medium
CN110414534A (en) * 2019-07-01 2019-11-05 深圳前海达闼云端智能科技有限公司 Image feature extraction method and device, storage medium and electronic equipment
WO2020000383A1 (en) * 2018-06-29 2020-01-02 Baidu.Com Times Technology (Beijing) Co., Ltd. Systems and methods for low-power, real-time object detection
CN111024078A (en) * 2019-11-05 2020-04-17 广东工业大学 Unmanned aerial vehicle vision SLAM method based on GPU acceleration
CN111124920A (en) * 2019-12-24 2020-05-08 北京金山安全软件有限公司 Equipment performance testing method and device and electronic equipment
CN111462060A (en) * 2020-03-24 2020-07-28 湖南大学 Method and device for detecting standard section image in fetal ultrasonic image
CN116739884A (en) * 2023-08-16 2023-09-12 北京蓝耘科技股份有限公司 Calculation method based on cooperation of CPU and GPU

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719275A (en) * 2009-11-23 2010-06-02 中国科学院计算技术研究所 Image feature point extracting and realizing method, image copying and detecting method and system thereof
CN103530224A (en) * 2013-06-26 2014-01-22 郑州大学 Harris corner detecting software system based on GPU
US20140225902A1 (en) * 2013-02-11 2014-08-14 Nvidia Corporation Image pyramid processor and method of multi-resolution image processing
CN105069743A (en) * 2015-07-28 2015-11-18 中国科学院长春光学精密机械与物理研究所 Detector splicing real-time image registration method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719275A (en) * 2009-11-23 2010-06-02 中国科学院计算技术研究所 Image feature point extracting and realizing method, image copying and detecting method and system thereof
US20140225902A1 (en) * 2013-02-11 2014-08-14 Nvidia Corporation Image pyramid processor and method of multi-resolution image processing
CN103530224A (en) * 2013-06-26 2014-01-22 郑州大学 Harris corner detecting software system based on GPU
CN105069743A (en) * 2015-07-28 2015-11-18 中国科学院长春光学精密机械与物理研究所 Detector splicing real-time image registration method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张舒 等: "《GPU高性能运算之CUDA》", 31 December 2009, 水利水电出版社 *
王志国: "局部特征算法SURF的GPU加速研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106067158B (en) * 2016-05-26 2019-09-06 东方网力科技股份有限公司 A kind of feature comparison method and device based on GPU
CN106067158A (en) * 2016-05-26 2016-11-02 东方网力科技股份有限公司 A kind of feature comparison method based on GPU and device
WO2018000724A1 (en) * 2016-06-28 2018-01-04 北京大学深圳研究生院 Cdvs extraction process acceleration method based on gpgpu platform
US10643299B2 (en) 2016-06-28 2020-05-05 Peking University Shenzhen Graduate School Method for accelerating a CDVS extraction process based on a GPGPU platform
CN107122244A (en) * 2017-04-25 2017-09-01 华中科技大学 A kind of diagram data processing system and method based on many GPU
CN107122244B (en) * 2017-04-25 2020-02-14 华中科技大学 Multi-GPU-based graph data processing system and method
CN107809643B (en) * 2017-11-13 2020-11-20 苏州浪潮智能科技有限公司 Image decoding method, device and medium
CN107809643A (en) * 2017-11-13 2018-03-16 郑州云海信息技术有限公司 A kind of coding/decoding method of image, device and medium
WO2020000383A1 (en) * 2018-06-29 2020-01-02 Baidu.Com Times Technology (Beijing) Co., Ltd. Systems and methods for low-power, real-time object detection
CN111066058B (en) * 2018-06-29 2024-04-16 百度时代网络技术(北京)有限公司 System and method for low power real-time object detection
CN111066058A (en) * 2018-06-29 2020-04-24 百度时代网络技术(北京)有限公司 System and method for low power real-time object detection
US11741568B2 (en) 2018-06-29 2023-08-29 Baidu Usa Llc Systems and methods for low-power, real-time object detection
CN110414534A (en) * 2019-07-01 2019-11-05 深圳前海达闼云端智能科技有限公司 Image feature extraction method and device, storage medium and electronic equipment
CN110414534B (en) * 2019-07-01 2021-12-03 达闼机器人有限公司 Image feature extraction method and device, storage medium and electronic equipment
CN111024078B (en) * 2019-11-05 2021-03-16 广东工业大学 Unmanned aerial vehicle vision SLAM method based on GPU acceleration
CN111024078A (en) * 2019-11-05 2020-04-17 广东工业大学 Unmanned aerial vehicle vision SLAM method based on GPU acceleration
CN111124920A (en) * 2019-12-24 2020-05-08 北京金山安全软件有限公司 Equipment performance testing method and device and electronic equipment
CN111462060A (en) * 2020-03-24 2020-07-28 湖南大学 Method and device for detecting standard section image in fetal ultrasonic image
CN116739884A (en) * 2023-08-16 2023-09-12 北京蓝耘科技股份有限公司 Calculation method based on cooperation of CPU and GPU
CN116739884B (en) * 2023-08-16 2023-11-03 北京蓝耘科技股份有限公司 Calculation method based on cooperation of CPU and GPU

Similar Documents

Publication Publication Date Title
CN105550974A (en) GPU-based acceleration method of image feature extraction algorithm
US11120304B2 (en) On-the-fly deep learning in machine learning at autonomous machines
US11727246B2 (en) Convolutional neural network optimization mechanism
CN110288509B (en) Computing optimization mechanism
TWI625697B (en) Block operations for an image processor having a two-dimensional execution lane array and a two-dimensional shift register
Du et al. Interactive ray tracing on reconfigurable SIMD MorphoSys
CN106095588B (en) CDVS extraction process accelerated method based on GPGPU platform
TW201921314A (en) Image processor, method performed by the same, and non-transitory machine readable storage medium
US10748238B2 (en) Frequent data value compression for graphics processing units
Daga et al. Implementation of parallel image processing using NVIDIA GPU framework
WO2018063480A1 (en) Graphics processor register renaming mechanism
Clemons et al. A patch memory system for image processing and computer vision
DE112020000902T5 (en) PRE-CALL DATA FOR GRAPHIC DATA PROCESSING
Cui et al. Real-time stereo vision implementation on Nvidia Jetson TX2
CN110634106A (en) Apparatus and method for conservative morphological antialiasing with multisampling
Wang et al. A CUDA-enabled parallel algorithm for accelerating retinex
JP2021077342A (en) Dynamically dividing activation and kernels for improving memory efficiency
CN109844802B (en) Mechanism for improving thread parallelism in a graphics processor
JP2021099779A (en) Page table mapping mechanism
CN105701760A (en) Histogram real-time generation method of geographic raster data optional polygon area
CN101783021A (en) Method for speeding up DR image processing by using operation of GPU
Liu et al. Parallel program design for JPEG compression encoding
Zhou et al. Gpu-based sar image lee filtering
US11972518B2 (en) Hybrid binning
Cheng et al. Performance optimization of vision apps on mobile application processor

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160504

WD01 Invention patent application deemed withdrawn after publication