CN105550974A

CN105550974A - GPU-based acceleration method of image feature extraction algorithm

Info

Publication number: CN105550974A
Application number: CN201510915260.6A
Authority: CN
Inventors: 张为华; 鲁云萍
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2015-12-13
Filing date: 2015-12-13
Publication date: 2016-05-04

Abstract

The invention belongs to the parallel processor technical field and relates to a GPU-based acceleration method of an image feature extraction algorithm. According to the GPU-based acceleration method of the invention, fine-granularity parallel implementation of existing main image feature extraction algorithms is performed on GPUs, and optimized acceleration can be performed according to the features of the GPUs; collaborative work mechanisms of asynchronous assembly lines are adopted to make the GPUs perform collaborative calculation; as indicated by a test result, when hardware is configured as an Intel Q8300 CPU and a GTX260 GPU, the speed of the algorithm is 172.33 frames per second, and is 67 times of the speed of a serial algorithm; and when the hardware is configured as an IntelI7 CPU and a GTX295 GPU, the speed of the algorithm is as high as 340.47 frames per second; and therefore, the requirements of real-time processing can be better satisfied.

Description

Image feature extraction algorithm acceleration method based on GPU

Technical Field

The invention belongs to the technical field of parallel processors, and particularly relates to an acceleration method for an image feature extraction algorithm.

Background

As humans enter the digital age, a large amount of data from different fields is produced each day. Among them, multimedia data types, such as images, videos, etc., have become one of the main data types. There has been increasing research interest in how to effectively screen information in an increasing amount of image/video data. Compared with the traditional text application, the application taking multimedia data as the center, such as a retrieval engine, a filtering system, copy detection and the like, has more and more extensive practical requirements. The image feature extraction algorithm is used as an important basic algorithm for image/video information retrieval and screening, can effectively extract information of key frames in images or videos for comparison among the images or videos, and is widely used as a main algorithm in the application of the system.

In terms of calculation, the image retrieval algorithm can be divided into a feature extraction stage and a feature matching stage. The feature extraction stage extracts features of the image according to a feature extraction algorithm, wherein the features include color features and texture features of the image, or feature points (such as particularly bright points) in the image. The characteristic matching stage judges whether the two images are matched or not by comparing the characteristics of the two images. In terms of accuracy, the local feature-based algorithm uses hundreds of features to represent an image (e.g., feature points in the image), and thus has high accuracy, and is used by more and more application systems. Currently, dominant local feature algorithms include SIFT and SURF algorithms. However, since the algorithm not only needs to process a huge amount of data, but also is complex, the processing speed is greatly limited, and in some applications, the real-time requirements of users cannot be met. Therefore, how to effectively improve the processing speed of the local feature extraction algorithm becomes a research hotspot in the current architecture and retrieval fields.

In recent years, with the development of semiconductor technology and the popularization of multi-core technology, various parallel computing systems are becoming mainstream of application design. With the enhancement of the versatility and programmability of the image processing unit gpu (general purpose processors), it also becomes an indispensable component. Modern GPUs are not only simple image processing engines, but also highly parallel programmable processors. The GPU uses more transistors to perform computations than the CPU, with the same number of transistors. The nature of high data parallelism enables the GPU to have more powerful arithmetic processing capabilities, while it also has higher memory bandwidth and cheaper price, which makes it very competitive in the high performance computing field.

The local feature extraction algorithm in the image retrieval has a plurality of parallel modes, which provides possibility for parallel implementation on a GPU. Meanwhile, the strong computing power of the GPU provides a powerful basis for improving the performance of the local feature extraction algorithm. Therefore, the invention aims to optimize and accelerate the image retrieval algorithm based on the local feature extraction algorithm by using the GPU.

Disclosure of Invention

The invention aims to provide an optimization and acceleration method for the current mainstream image feature extraction algorithm.

The method for optimizing and accelerating the image feature extraction algorithm mainly utilizes the GPU technology. The image feature extraction algorithm is implemented on the GPU in a fine-grained parallel mode, and optimization is carried out according to the characteristics of the GPU; meanwhile, a cooperative work mechanism of an asynchronous pipeline is adopted to enable the CPU and the GPU to work cooperatively, so that the processing efficiency is further improved.

The invention firstly realizes the parallel algorithm of image feature extraction on GPU in fine granularity, wherein the fine granularity is that data parallelism is developed by using the minimum granularity in each stage of the local feature extraction algorithm, for example, according to each feature point. The fine-grained parallelism can more fully exert the parallel computing capability of the GPU. The invention uses CUDA (compute unified device architecture) programming model to map the fine-grained data to GPU for computation in parallel. The selected local feature search algorithm is currently the mainstream search algorithm SURF (Speeded-uprobubustfeatures). Before describing a specific parallel implementation, we first introduce the main algorithm process of the local feature retrieval algorithm, and then describe how each stage is implemented on the GPU.

The local feature retrieval algorithm detects image features and describes the features, and mainly comprises three stages: image initialization, feature detection, and feature description, as shown in fig. 1.

And (3) image initialization, wherein the process comprises three steps of loading an image, calculating a gray image and calculating an integral image. The feature detection unit detects a feature point (for example, a particularly bright point among dark points) of the image by using the integral image. The feature description section then describes the found feature points using a specific data structure to facilitate later processing of the image. In the local feature retrieval algorithm, the main calculation time focuses on two parts of feature detection and feature description, so the invention puts the two parts on a GPU for calculation.

In the feature detection, a feature value of each point is calculated first. In the SURF algorithm, the concept of sampling is adopted to avoid calculating the characteristic value of each pixel point. When the phase is mapped onto the GPU, each sampling pixel is a GPU thread because the feature value calculation of each point is relatively independent (the number of threads of each phase on the GPU is shown in parentheses in fig. 1). Because more judgment statements exist in the stage and the performance of the GPU is influenced, o kernel loop calculations are used for avoiding branch calculations on the GPU, and o is the number of image layers in the SURF algorithm. And when the feature value of each point is calculated, entering a feature point positioning stage. This stage selects the point with the largest feature value among every 8 neighboring points (including 2 image layers) as the selectable feature point. Thus, every 8 points are calculated for one GPU thread.

The characteristic description can be divided into three steps of calculating a Harr wavelet value, calculating a characteristic direction and creating a characteristic window. Firstly, Harr wavelet transformation is calculated for all points in a circle with a feature point as a center and 6 scales as a radius by definition in an SURF algorithm, so that 109 points are arranged around each feature point, and when the Harr wavelet transformation is mapped onto a GPU, one point of the Haar wavelet transformation is calculated for each thread. Then, the SURF algorithm takes 0.15 radian as the size of the direction region when calculating the feature direction, so as to obtain 2 × pi/0.15 ≈ 42 feature direction regions, the 42 feature regions are voted for 109 surrounding points, and the direction with the largest number of votes is the direction of the feature point. Each feature area is a GPU thread in this phase. Finally, each feature point will generate a 64-dimensional feature vector, which is calculated from the 4 × 4 region around the feature point, and thus each feature point is divided into 16 GPU threads for calculation.

On the basis of fine-grained parallelism, the method further utilizes the characteristics of the GPU to further optimize and accelerate. The method specifically utilizes the characteristics of a GPU memory, namely a GPU texture memory, in processing 2-dimensional data to improve the performance of the algorithm and reduce the repeated allocation and release of the algorithm to the memory as much as possible.

GPU texture memory provides support for two-dimensional, three-dimensional locality on hardware. That is, for the two-dimensional array in the texture memory, when a pixel is accessed, the upper, lower, left and right pixels are simultaneously placed in the cache of the texture memory. There are a number of fast approximate integral computations in SURF algorithms that require access to an array of integrals and that have significant two-dimensional locality. The texture memory of the GPU is utilized to have good performance improvement on the memory access of the two-dimensional data. Binding a specific two-dimensional array to GPU texture memory can be used in CUDAcudaBindTexture2DA method.

Meanwhile, when the original image, the integral image, the characteristic value and other variables are stored in the algorithm, memory allocation and release are carried out at the beginning and the end of each image processing. Such allocation and release of GPU memory is redundant and performance-impacting when processing pictures in batches. The invention allocates the fixed memory at the initial stage of the program, thereby reducing the redundant memory allocation and release.

The invention also enables the CPU and the GPU to work cooperatively in an asynchronous pipeline mode. In the traditional case, the flow of the cooperative computing of the CPU and the GPU is as follows: (1) the CPU calculates data required to be provided; (2) the CPU transmits data from the CPU memory to the GPU memory; (3) and the GPU performs calculation. In such a flow, when the GPU performs calculation, the CPU resources are idle, and similarly, the GPU always waits for the CPU to perform calculation and transmit data before performing processing. Equivalent to CPU computations, data transfers and GPU computations are performed in a serial fashion. In order to enable the overlapping of the three execution times and thus further improve the performance, the invention is implemented in an asynchronous pipeline manner.

The pipeline mode is that the whole algorithm is divided into two parts, the CPU processes the calculation of the first part, the GPU processes the calculation of the second part, and data is transmitted between the CPU and the GPU in a stream mode, so that the aim of parallel work of two pieces of hardware is fulfilled. That is, when the CPU performs the calculation of the next picture, the GPU can simultaneously process the calculation of the previous picture, thereby achieving the effect of parallel calculation. In the specific implementation of the technology, the CPU is specially used for calculating initialization data and transmitting the data into a GPU memory, and the GPU reads out data such as integral images from the memory and performs feature detection and feature description calculation. Since the speed of the CPU initializing the image is approximately equivalent to the speed of the GPU feature retrieval and description, the CPU and the GPU can effectively keep parallel computing. Meanwhile, due to the characteristic that the GPU supports DMA asynchronous transmission and is combined with the pipeline, the time of a data transmission part can be further overlapped, namely the pipeline of three stages of CPU calculation, data transmission and GPU calculation is formed. In a pictographic manner, when the CPU performs initialization calculation on the ith picture, the initialization data of the (i-1) th picture is simultaneously transferred to the GPU memory, and the GPU is processing feature detection and description of the (i-2) th picture at the moment. Through the cooperation mechanism of the asynchronous pipeline, CPU calculation, data transmission and GPU calculation can be effectively carried out in parallel, so that the performance is improved.

In addition, the invention also makes full use of the residual resources of the CPU, and enables the redundant cores to independently complete part of the algorithm, thereby improving the performance. Due to the popularity of multi-core processors, we also consider that the remaining computing resources of the CPU should also be utilized. Assuming a 4-core CPU, one core is used for controlling the GPU and data transmission, and the other core performs CPU calculation, the other two cores can independently complete feature extraction calculation of a partial picture, as shown in fig. 2.

Test results show that the algorithm speed of the present invention is 172.33 frames/sec, which is 67X in series, when the hardware is configured as intel q8300 CPU and GTX260 GPU. And when the hardware is configured to be a CPU of Intel I7 and a GPU of GTX295, the speed is up to 340.47 frames/second, and the real-time processing requirement can be met.

Drawings

Fig. 1 is a schematic diagram of fine-grained implementation of an image feature extraction algorithm based on a GPU. Wherein, gray is the calculation part on the GPU, and white is the calculation part on the CPU. In each phase, the number in parentheses indicates how many threads the phase is computed on the GPU. If there are square brackets, the number therein indicates how many cores the computation of the phase is divided into.

FIG. 2 is a diagram illustrating an asynchronous pipeline cooperation mechanism between a GPU and a CPU.

Fig. 3 is a performance test chart.

FIG. 4 is a drawing showingbuildDetSpecific codes of the method.

FIG. 5 is a drawing showingbuildDetKernelPart code of the method.

FIG. 6 is a partial code of the pipeline.

Detailed Description

The techniques of the present invention will be described in detail below with reference to the figures and source code in a program that illustrate the invention. The method mainly utilizes a CUDA programming model to map a local feature extraction algorithm in the image feature extraction algorithm to a GPU for calculation in a fine-grained parallel mode, and further optimizes the realization of the algorithm on the GPU through the characteristics of the GPU and the cooperative working mode of the GPU and a CPU. The local feature retrieval algorithm selected by the invention is the retrieval algorithm SURF (Speeded-UpRobustFeatures) which is the mainstream at present. We will now describe in detail the specific implementation of this technique and test the performance of this technique.

(1) Fine-grained GPU implementation

The invention maps two stages of feature detection and feature description in an image retrieval algorithm to a GPU for calculation. As described above and shown in fig. 1, these two phases can be divided into several small steps. When calculating the characteristic value, each sampling pixel point is used as parallel granularity. And when the feature points are positioned, each 8 sampling pixel points are used as parallel granularity. When haar wavelet transform is calculated, 109 pixels are needed to be calculated around each feature point, and at the moment, each surrounding pixel point is a GPU thread. Next, the 109 surrounding points vote for 42 feature direction regions of the feature points, so the feature direction calculation stage takes each feature direction region as a GPU thread. And finally, when each feature point generates a vector, calculating the vector from a 4 x 4 area around the feature point, wherein each area is taken as a GPU thread in the process.

In a specific code implementation, a CUDA programming model is used, and here, only the step of calculating the feature value during feature detection is taken as an example, and the implementation methods of other steps are similar. There are two main methods involved in the calculation of eigenvalues:

voidbuildDet(float*m_det,intwidth,intheight);

__global__voidbuildDetKernel(float*m_det,into,intborder,intstep);

wherein,buildDetmethod internal callbuildDetKernelA method. Due to the fact thatbuildDetKernelThe method comprises__global__The method is therefore run on the GPU.m_detIs a number of pointsThe group pointer is used for storing the characteristic value of the image, and the array space should be allocated on the GPU before the GPU method calls the method. The method for distributing the space on the GPU is thatcudaMalloc. Meanwhile, some data such as the width and height of the image and the pixel value of the original integral image also need to be transmitted to the memory of the GPU for use in subsequent calculation, and some simple parameters can be usedcudaMemcpyToSymbolTo realize, the array with larger data size can be used firstcudaMallocThe method allocates a space on the GPU and then reuses itcudaMemcpyThe method transfers data from the CPU memory to the GPU memory.

buildDetThe specific code implementation of the method is shown in fig. 4, in which the statements that need special attention are underlined. For each of the imagesoctaveEach sample needs to be calculated (in order to)stepSelected for the interval) the characteristic values of the pixel points. The specific calculation of the characteristic value isbuildDetKernelIn the method. We divide the original image into width x heightBLOCK_W*BLOCK_HOur default width and height parameters are 16 x 8 for one small image block. Each image block has 4 intervals, so that each image block has 16 × 8 × 4 sampling pixel points. Each image Block corresponds to the concept of a Block in CUDAGPU programming, and GPU threads in the same Block share GPUsharedmemory.threads()The method defines that 16 × 8 × 4 pixels in each image block are all an independent thread, that is, actually, the calculation of the feature value is based on sampling pixels as parallel granularity. Thus, there is a total of((ws+BLOCK_W-1)/BLOCK_W)*((hs+BLOCK_H-1)/BLOCK_H)Blocks, ws, and hs are the image width and height of the removed border and points not sampled. When the data are divided, the data can be calledbuildDetKernelA method.

buildDetKernelThe computations in the method are run on the GPU, and each GPU thread runs the same code, except that the data input for the computations is different.buildDetKernelPart of the code of the method is shown in fig. 5, and the method needs to calculate which pixel point in the image is calculated by the thread at the beginningc-column，r-row，i-interval identification.

(2) Memory optimization using GPU characteristics

On the basis of realizing the version of the GPU fine-grained parallelism, the algorithm is further optimized by utilizing the memory characteristic of the GPU. The memory optimization of the GPU mainly comprises two aspects: firstly, the GPU texture memory provides support for two-dimensional locality, and group access can be optimized according to two-dimensional locality characteristics existing in an algorithm. And secondly, the expenses of memory allocation and release in the image processing process are reduced.

GPU texture memory provides support for two-dimensional, three-dimensional locality on hardware. However, the SURF algorithm has a large amount of calculation of fast approximate integral and has obvious two-dimensional locality. The texture memory of the GPU is utilized to have good performance improvement on the memory access of the two-dimensional data. Binding a specific two-dimensional array to texture memory, which can be used in CUDAcudaBindTexture2DA method. The invention binds the integral image array to the 2D texture memory, thereby utilizing the locality support of the 2D texture memory.

(3) CPU and GPU cooperation mechanism

Although the GPU has a strong computing capability, the GPU still needs the CPU to perform auxiliary control during computation, and therefore how to most efficiently cooperate with the CPU and the GPU becomes an important factor affecting performance. The CPU and GPU cooperation mechanism of the present invention mainly includes two aspects, as shown in fig. 2: (1) and the CPU and the GPU are enabled to work in parallel in an asynchronous pipeline mode. (2) And fully mining the residual computing capacity of the CPU.

The pipeline mode is that the whole algorithm is divided into two parts, the CPU processes the calculation of the first part, the GPU processes the calculation of the second part, and data is transmitted between the CPU and the GPU in a stream mode, so that the aim of parallel work of two pieces of hardware is fulfilled. That is, when the CPU performs the calculation of the next picture, the GPU can simultaneously process the calculation of the previous picture, thereby achieving the effect of parallel calculation. In the pipeline implementation, the CPU mainly performs the initialization loading work of the image, while the GPU performs the feature detection and description calculation, and part of the code of the pipeline is shown in fig. 6. Wherein:

the left graph is a part of codes of the CPU loading picture data, and the right graph is a part of codes of the GPU control thread.indexIndicating that it is currently the fourth CPU load thread or the fourth GPU control thread, and each CPU load thread corresponds to one GPU control thread (each GPU control thread controls the computation of one GPU).idIs the identification number of the thread,GPUNUMfor the number of GPU threads, in our implementation, frontGPUNUMEach thread loads a thread for the CPU, thenGPUNUMEach thread is a GPU control thread. For each thread, it traverses the access picture, and multiple load threads or multiple GPU control threads will be accessed in a roundbin fashion. Taking two loading threads as an example, the first thread accesses a picture with a single number, and the second thread accesses a thread with an even number. In the loading of the thread,int_imgthe image data structure after the loading is completed comprises the width and the height of the image and the value of each pixel point in the image. The pixel value is a large array, the performance is affected if the memory of each picture needs to be reallocated, a cache is created in the process of optimizing the memory, and the cache is allocated in advanceIMGBUFSIZEThe array of the size of each picture can be recycled, and each time a picture data structure is newly built, one array in the cache is allocated to the new picture data structureint_imgAnd the array is cleared, thereby reducing the time of memory allocation. After the data loading is finished, the handleint_imgIs put toimgs_gIn this array, and will correspond toflag_gThe setting is 1, and the method comprises the following steps of,imgs_garray sumflag_gArray is largeAll are as followsIMGSIZEAnd can be accessed by the GPU control thread. The GPU controls the thread to access the pictures in sequence when the ith pictureflagWaiting when the flag is not 1, and calling the CUDA method when the flag is 1, namely the picture is loaded completelyDetDes()The picture data to be calculated is alreadyimgs_g[i]In (1). Each time loading a pictureloadnumber[index]++When the GPU operation of one picture is completed,processed[index]++when is coming into contact withloadnumber[index]-processed[index]When the cache size is larger than the cache size, the loading thread speed is high, and the loading thread needs to wait for the calculation of the GPU.

Due to the fact that the GPU supports DMA asynchronous transmission and is combined with the pipeline, the time of the data transmission part can be further overlapped, and the pipeline of three stages of CPU calculation, data transmission and GPU calculation is formed. In a pictographic manner, when the CPU performs initialization calculation on the ith picture, the initialization data of the (i-1) th picture is simultaneously transferred to the GPU memory, and the GPU is processing feature detection and description of the (i-2) th picture at the moment. In the specific embodiment, as shown in the above code, in the GPU control thread, if the CUDA method is to be invoked to calculate the ith picture, it is necessary to load both the ith and (i + 1) th pictures. In thatDetDes()In the method will be usedcudaMemcpy2DAsync()The method is used for asynchronously transmitting the data of the (i + 1) th picture, so that the data transmission time and the GPU calculation time are overlapped.

In addition, the invention also makes full use of the residual resources of the CPU, and enables the redundant cores to independently complete part of the algorithm, thereby improving the performance. Assuming a 4-core CPU, one core is used for controlling the GPU and data transmission, and the other core performs CPU calculation, the other two cores can independently complete feature extraction calculation of a partial picture, as shown in fig. 2.

(4) Performance testing

The invention also comprises a detailed test result, the tested host computer is a 4-core CPU of Intel Q8300, and the memory size is 2 GB. The GPU used for the test was GeForceGTX260, which had 27 SMs, 216 cores total, a clock rate of 1.24GHz, and a video memory size of 1 GB. The host operating system is Ubuntu8.10 (the linux kernel version is 2.6.27-7-genetic).

As shown in fig. 3, when the CPU executes serially, the local image feature extraction algorithm needs 393 milliseconds, i.e. 2.56 images are processed in 1 second for each picture, which is much slower than the real-time processing speed. The SURF algorithm (parallel computation in the fastest blocks) is performed in parallel on the CPU, and the speed is increased to 14.86 frames/second. And the fine-grained GPU implementation improves the algorithm to 74.45 frames/second, which is 5X of the parallel version on the CPU. Through GPU memory optimization and cooperative work with a CPU, the speed is increased to 84.56 frames/second and 172.33 frames/second. This speed has reached real-time processing speed.

Further testing was performed with the CPU and GPUGTX295 (480 cores) in a hardware configuration of intel 7, which showed local feature extraction algorithms up to 340.47 frames/sec.

Claims

1. A GPU-based image feature extraction algorithm acceleration method is characterized by comprising the following steps:

firstly, realizing a parallel algorithm for image feature extraction on a GPU in a fine-grained manner, wherein the fine-grained manner is that data parallelism is developed by respectively using the minimum granularity at each stage of a local feature extraction algorithm, namely according to each feature point; using a CUDA programming model to parallelly map the fine-grained data to a GPU for calculation; the local feature retrieval algorithm adopts a retrieval algorithm SURF;

the local feature retrieval algorithm detects image features and describes the features, and is divided into three stages: image initialization, feature detection and feature description; two parts of feature detection and feature description are put on a GPU for calculation;

the image initialization comprises three steps of loading an image, calculating a gray image and calculating an integral image;

the feature detection is to detect feature points of the image by using an integral image; firstly, calculating a characteristic value of each point; when the stage is mapped to a GPU, the feature value calculation of each point is relatively independent, and each sampling pixel point is a GPU thread; specifically, o kernel loop calculations are used to avoid branch calculations on the GPU, where o is the number of image layers in the SURF algorithm; when the feature value of each point is calculated, entering a feature point positioning stage; in the stage, selecting a point with the maximum characteristic value from 8 adjacent points as an optional characteristic point; calculating for each GPU thread at every 8 points;

the feature description uses a specific data structure to describe the found feature points so as to facilitate the subsequent processing of the image; firstly, performing haar wavelet transform on 109 pixel points around each feature point, wherein the pixel points around each feature point at the stage are one GPU thread; then, the 109 surrounding points vote for 42 characteristic direction areas of the characteristic points, the direction with the largest number of votes is the direction of the characteristic points, and each characteristic area at this stage is a GPU thread; finally, each feature point generates a 64-dimensional feature vector, and the calculation of the vector is calculated from a 4 × 4 region around the feature point, so that each feature point is divided into 16 GPU threads for calculation.

2. The method for accelerating GPU-based image feature extraction algorithm according to claim 1, characterized in that: the method has the advantages that the characteristics of a GPU memory, namely a GPU texture memory, in processing 2-dimensional data are utilized to improve algorithm performance, and repeated allocation and release of the algorithm to the memory are reduced as much as possible; particularly for the two-dimensional array involved in SURF algorithm, the two-dimensional array is used in CUDAcudaBindTexture2DThe method is bound to a GPU texture memory;

meanwhile, when the original image, the integral image, the characteristic value and other variables are stored in the SURF algorithm, memory allocation and release are carried out at the beginning and the end of processing each image; when processing pictures in batch, a fixed memory is allocated at the initial stage of a program, thereby reducing redundant memory allocation and release.

3. The method for accelerating GPU-based image feature extraction algorithm according to claim 2, characterized in that: enabling the CPU and the GPU to work cooperatively in an asynchronous pipeline mode;

the pipeline mode is that the whole algorithm is divided into two parts, the CPU processes the calculation of the first part, the GPU processes the calculation of the second part, and data is transmitted between the CPU and the GPU in a stream mode to realize the parallel work of the two pieces of hardware; the CPU is specially used for calculating initialization data and transmitting the data into a GPU memory, and the GPU reads out data such as integral images and the like from the memory and carries out feature detection and feature description calculation;

meanwhile, due to the characteristic that the GPU supports DMA asynchronous transmission and is combined with a pipeline, the time of a data transmission part is further overlapped, and the pipeline of three stages of CPU calculation, data transmission and GPU calculation is formed.

4. A method for accelerating GPU-based image feature extraction algorithms according to claim 3, characterized by: and residual resources of the CPU are also utilized, and redundant cores of the CPU are enabled to independently complete the processing of part of algorithms.