CN115641264A

CN115641264A - Image preprocessing method and image preprocessing system based on CUDA

Info

Publication number: CN115641264A
Application number: CN202211386248.7A
Authority: CN
Inventors: 于诗梦; 徐高伟; 王逸平; 董树才; 王鑫琛; 吴建康; 邢少杰; 陈大宇
Original assignee: Smart Motor Shanghai Robot Technology Co ltd; Zhejiang Zhima Intelligent Technology Co Ltd
Current assignee: Smart Motor Shanghai Robot Technology Co ltd; Zhejiang Zhima Intelligent Technology Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-01-24

Abstract

The invention provides a CUDA-based image preprocessing method and system, and relates to the technical field of image processing. Firstly, acquiring a single-frame image and pixel value data of all pixels in the single-frame image read by a CPU (central processing unit); then carrying out zooming operation on the single-frame image to obtain a zoomed image; and finally calling a pre-programmed CUDA kernel function to calculate the size of the pixel value of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function. In the technical scheme, because each step of the image preprocessing stage calls the CUDA kernel function to perform parallel computation on all pixel values on the GPU, compared with the traditional method for computing the pixel values in series by using a single pixel, the method has the advantages that the computation speed of the image preprocessing stage is greatly increased, and the overall computation time is greatly reduced.

Description

Image preprocessing method and image preprocessing system based on CUDA

Technical Field

The invention relates to the technical field of image processing, in particular to a CUDA-based image preprocessing method and system.

Background

The traditional lane line detection image preprocessing method can effectively meet the input requirement of a neural network in deep learning, but the image processing speed is low, and the time consumption is high.

The reason why the image processing speed is slow is that a single image is input, and in the step of scaling the image size in the preprocessing, when a new pixel value of the image is obtained, a CPU is required to perform serial calculation on each pixel in the image, so the calculation speed is slow. In the model training process, the training data set is often large, and the total training time is greatly increased. The trained model may also cause delay lag during real-time lane detection.

Disclosure of Invention

The invention aims to provide an image preprocessing method based on CUDA (compute unified device architecture), which is applied to a GPU (graphics processing unit) end and solves the technical problems of low image processing speed and high time consumption in the prior art.

The second aspect of the present invention is to provide an image preprocessing method based on CUDA, which is applied to the CPU.

An object of a third aspect of the present invention is to provide a CUDA-based image preprocessing system.

According to the purpose of the first aspect of the present invention, the present invention provides an image preprocessing method based on CUDA, applied to a GPU terminal, the image preprocessing method comprising the steps of:

acquiring a single-frame image and pixel value data of all pixels in the single-frame image read by a CPU (central processing unit);

carrying out zooming operation on the single-frame image to obtain a zoomed image;

calling the CUDA kernel function which is written in advance so as to utilize the CUDA kernel function to sequentially and parallelly calculate the pixel value size of each pixel in the zoomed image according to the pixel value data of each pixel read by the CPU.

Optionally, after the CUDA kernel function is called to calculate the size of the pixel value of each pixel in the scaled image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function, the method further includes the following steps:

and respectively carrying out normalization and centralization calculation on the pixel value of each pixel in the zoomed image so as to obtain preprocessed image data.

Optionally, after the normalizing and the centering calculation are performed on the pixel value of each pixel in the scaled image to obtain the pre-processed image data, the method further includes the following steps:

and transmitting the preprocessed image data to a CPU.

Optionally, the method further comprises the following steps: allocating memory on the GPU according to the single-frame image;

after the preprocessed image data is transmitted to the CPU, the method further comprises the following steps:

and releasing the memory allocated on the GPU.

Optionally, the kernel function starts all thread blocks simultaneously when executing on the GPU, each thread block is composed of a plurality of threads, and each thread performs pixel value calculation of one pixel in one scaled image.

Optionally, the number of all the thread blocks is m × n, where m represents a quotient of a width of the single frame image and a number of columns of threads included in the thread block, and n represents a quotient of a height of the single frame image and a number of rows of threads included in the thread block.

According to the object of the second aspect of the present invention, the present invention further provides a CUDA-based image preprocessing method applied to the CPU, the image preprocessing method comprising the steps of:

acquiring a single-frame image;

reading pixel value data of all pixels in the single-frame image;

receiving the preprocessed image data obtained by the image preprocessing method.

Optionally, the method further comprises the following steps: allocating memory on the CPU according to the single-frame image;

after receiving the preprocessed image data, the method further comprises the following steps: and releasing the memory allocated on the CPU.

According to an object of a third aspect of the present invention, the present invention further provides a CUDA-based image preprocessing system, comprising:

a CPU configured to acquire a single frame image, read pixel value data of all pixels in the single frame image, and receive preprocessed image data obtained by the above image preprocessing method;

the GPU is connected with the CPU and is provided with a pre-programmed CUDA kernel function, and the GPU is configured to acquire a single-frame image and pixel value data of all pixels in the single-frame image read by the CPU; then carrying out zooming operation on the single-frame image to obtain a zoomed image; and then calling the CUDA kernel function to calculate the size of the pixel value of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function.

Optionally, the GPU is further configured to perform normalization and centering calculation on the pixel values of the pixels in the scaled image, respectively, to obtain preprocessed image data.

In the invention, a CUDA kernel function needs to be written in advance, and an image preprocessing method firstly acquires a single-frame image and pixel value data of all pixels in the single-frame image read by a CPU; then carrying out zooming operation on the single-frame image to obtain a zoomed image; and calling a pre-programmed CUDA kernel function to calculate the pixel value of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function. In the technical scheme, as each step of the image preprocessing stage calls the CUDA kernel function to perform parallel computation on all pixel values on the GPU, compared with the traditional method for serially computing the pixel values by using a single pixel, the method has the advantages that the computation speed of the image preprocessing stage is greatly increased, and the overall computation time is greatly reduced.

The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the invention will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow diagram of a CUDA-based image pre-processing method according to one embodiment of the invention;

FIG. 2 is a schematic flow diagram of a CUDA-based image pre-processing method according to another embodiment of the present invention;

FIG. 3 is a schematic diagram of a bilinear interpolation algorithm in a CUDA-based image preprocessing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of thread blocks in a CUDA-based image processing method according to one embodiment of the invention;

fig. 5 is a connection block diagram of a CUDA-based image preprocessing system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Fig. 1 is a schematic flow diagram of a CUDA-based image preprocessing method according to an embodiment of the present invention. As shown in fig. 1, in a specific embodiment, the CUDA-based image preprocessing method is applied to a GPU side, the GPU has a CUDA kernel programmed in advance, and the image preprocessing method includes the following steps:

step S100, acquiring a single frame image and pixel value data of all pixels in the single frame image read by a CPU;

s200, carrying out zooming operation on the single-frame image to obtain a zoomed image;

and step S300, calling a pre-programmed CUDA kernel function to calculate the pixel value size of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function. The CUDA kernel function is written in advance, and the CUDA kernel function can realize preprocessing calculation of a single pixel. When the GPU executes the image preprocessing method, the CUDA kernel may be directly called in response to an operation by the user.

In the embodiment, each step of the image preprocessing stage calls the CUDA kernel function to perform parallel computation on all pixel values on the GPU, so that compared with the traditional method for serially computing the pixel values by using a single pixel, the method has the advantages that the computation speed of the image preprocessing stage is greatly increased, and the overall computation time is greatly reduced.

Fig. 2 is a schematic flow chart of a CUDA-based image preprocessing method according to another embodiment of the present invention. As shown in fig. 2, in this embodiment, after step S300, the following steps are further included:

step S400, normalizing and centering the pixel values of the pixels in the scaled image to obtain the preprocessed image data.

After step S400, the following steps are also included:

step S500, the preprocessed image data is transmitted to the CPU.

In particular, when the deep learning method is used for lane line detection, a preprocessing operation is usually required for a lane line picture. The step of lane line detection image preprocessing includes size scaling, normalization and centering of the image.

The scaling of the image refers to reducing or enlarging the unprocessed image to be detected to a uniform size. This is done because the neural network requires that the data size of the input model must be uniform in each batch, so when performing lane line detection using the deep learning method, all images to be trained and detected need to be adjusted to a uniform size. After the size of the image is changed, the number and the position of the pixels are changed, so that the size of each pixel needs to be recalculated, that is, the numerical values of three channels of RGB corresponding to each pixel need to be calculated by using an interpolation algorithm. Interpolation algorithms commonly used in image scaling include nearest neighbor interpolation, bilinear interpolation, and bicubic interpolation.

Normalization of the image refers to dividing all pixels in the image by 255. The original pixel value of the image is the fluid 8 type data in the range of 0-255, and the original value between 0-255 is converted into the double type data and then divided by 255, so that the original value between 0-255 is converted into the double type data between 0.0-1.0. This is done because, if the original value in the range of 0-255 is retained, the pixel value of the picture is transmitted as input data to the neural network, since the input value is large, the gradient transmitted back to the input layer in the backward propagation process is large, if the gradient is large, the learning rate of the neural network must be very small, otherwise the local optimal solution is crossed, so that the input value is converted into a decimal between 0.0-1.0, the size of the model input is reduced, the learning rate is not always small, and the convergence rate of the model is accelerated.

Centralization of the image refers to subtracting the mean of the three channels across the entire data set from each pixel value of the channels in the image. The purpose of this is to make the average of the pixel values of the image on each channel zero, even if the input data is centered at zero. In most cases, the illumination of the image has no influence on the recognition of objects in the image, no matter whether the human eyes or the computer take image information, the image information is usually obtained from relative color differences between pixels, and the centering of the image removes the mean value of the pixels, namely the average brightness value of the image in the removing process, so that the neural network model can better acquire the information of the image.

In this embodiment, the image preprocessing method further includes the steps of: allocating memory on a GPU according to the single-frame image;

after the image data after the preprocessing is transmitted to the CPU, the method also comprises the following steps:

and releasing the memory allocated on the GPU.

This embodiment also requires the release of allocated memory on the CPU when the memory allocated on the GPU is released. The embodiment releases the allocated memory in time after the image preprocessing is finished, thereby avoiding memory leakage.

In this embodiment, a bilinear difference algorithm is used to calculate the pixel value of each pixel in the scaled image.

The functions to be realized by the CUDA kernel in this embodiment are parallel computation of the pixel values after scaling the image size, pixel value normalization, and pixel value centralization. The position of the pixel is changed after the image size is changed, and the data value of each pixel point of the new image after the size conversion is calculated through interpolation according to the numerical value of the original pixel point. Common interpolation algorithms for calculating the pixel value after image size scaling include nearest neighbor interpolation, bilinear interpolation and bicubic interpolation. The nearest neighbor interpolation algorithm is simplest to implement, the calculated amount is also minimum, the operation speed is high, but the image information loss is high, the image quality after the size is changed is greatly reduced, and the phenomena of mosaic and sawtooth are generated. The bilinear interpolation algorithm is more complex than the nearest neighbor interpolation algorithm, the calculated amount is slightly larger, the operation time is slightly longer, but the quality of the image after the size scaling is higher, and the effect is better than that of the nearest neighbor interpolation algorithm. The bicubic interpolation method has the best effect among three image scaling algorithms, has the most obvious advantage during image scaling, but has the most complex algorithm, so the calculation amount is large, and the calculation time is long. In addition, in the lane line detection, since the preprocessing of the lane line image usually performs a reduction operation on the image, the bicubic interpolation method has no obvious advantages compared with the bilinear interpolation algorithm, but the operation time of the bicubic interpolation method is obviously longer than that of the bilinear interpolation method. The algorithm used in this embodiment to calculate the scaled image pixel value size is a bilinear interpolation algorithm. In other embodiments, other difference algorithms may be substituted depending on processing speed and accuracy requirements.

Fig. 3 is a schematic diagram of a bilinear interpolation algorithm in a CUDA-based image preprocessing method according to an embodiment of the present invention. As shown in fig. 3, wherein (x) ₀ ，y ₀ )、(x ₀ ，y ₁ )、(x ₁ ，y ₀ ) And (x) ₁ ，y ₁ ) Are pixel coordinate points on the original image, and the corresponding pixel values are f (x) respectively ₀ ，y ₀ )、f(x ₀ ，y ₁ )、f(x ₁ ，y ₀ ) And f (x) ₁ ，y ₁ ). (x ', y') is the pixel coordinate point after the size transformation, and the pixel value f (x ', y') is an unknown value, i.e. a new pixel value required.

First, according to (x) ₀ ，y ₀ ) And (x) ₁ ，y ₀ ) Pixel value f (x) of ₀ ，y ₀ ) And f (x) ₁ ，y ₀ ) Performing one-dimensional linear interpolation along the x axis to obtain (x', y) ₀ ) Pixel value f (x', y) of (d) ₀ ) Is then according to (x) ₀ ，y ₁ ) And (x) ₁ ，y ₁ ) Pixel value f (x) of ₀ ，y ₁ ) And f (x) ₁ ，y ₁ ) One-dimensional linear interpolation along the x-axis to obtain (x', y) ₁ ) Pixel value f (x', y) of (d) ₁ ). Then is composed of (x', y) ₀ ) And (x', y) ₁ ) Pixel value f (x', y) ₀ ) And f (x', y) ₁ ) And (3) performing one-dimensional linear interpolation along the y axis to finally obtain a point (x ', y') to be solved, namely a pixel value f (x ', y') corresponding to the new pixel coordinate point after size conversion.

Fig. 4 is a schematic and schematic diagram of thread blocks in a CUDA-based image processing method according to an embodiment of the present invention. As shown in fig. 4, in this embodiment, the kernel function starts all thread blocks simultaneously when executing on the GPU, each thread block being composed of a plurality of threads, each thread performing a pixel value calculation for one pixel in one scaled image.

Specifically, the number of all thread blocks is m × n, where m represents the quotient of the width of the single frame image and the number of columns of threads contained in the thread block, and n represents the quotient of the height of the single frame image and the number of rows of threads contained in the thread block.

The CUDA kernel needs to be activated by starting multiple threads when executing on the device. All threads started by one kernel function at the same time are called as a network, all threads on the same network share the same global memory space, one network is composed of a plurality of thread blocks, and each thread block comprises a plurality of threads.

In this embodiment, each thread block contains 16 × 16 threads, the index starts from (0, 0), and each thread performs operations in the kernel function, and is responsible for preprocessing operations of a single pixel value, including size calculation of the pixel value after image size scaling, pixel value normalization, and centering calculation. The number of thread blocks in this embodiment is m x n, where m is equal to the width (width) of the image of the single frame input divided by the number of columns of threads contained in a single thread block, i.e., width/16, and n is equal to the height (height) of the image of the single frame input divided by the number of rows of threads contained in a single thread block, i.e., height/16. Note that the width and height refer to the number of pixels, for example, an image with a resolution of 640 × 360 has a width of 640 and a height of 360.

The embodiment also provides an image preprocessing method based on CUDA, which is applied to a CPU end, and the image preprocessing method comprises the following steps:

step one, acquiring a single-frame image;

reading pixel value data of all pixels in the single-frame image;

and step three, receiving the preprocessed image data obtained by the image preprocessing method.

In this embodiment, the image preprocessing method further includes the steps of: allocating memory on a CPU according to the single frame image;

after receiving the preprocessed image data, the method also comprises the following steps: and releasing the memory allocated on the CPU.

The embodiment releases the allocated memory in time after receiving the preprocessed image data, thereby avoiding memory leakage.

The embodiment proposes a CUDA-based image preprocessing method, which improves the operation speed of image preprocessing by using the cooperative calculation of the GPU and the CPU. For an input image, a CPU is used for reading pixel value data, then image data is copied to a GPU, and a CUDA kernel function which is designed and written is called to perform parallelization calculation, pixel value normalization calculation and centralization calculation of pixel values after image size scaling. The principle of parallel computing is that pixels in an image are divided according to thread number, each thread is responsible for processing one pixel point, and all threads perform parallel computing, so that the computing speed in the image preprocessing stage is greatly increased.

Fig. 5 is a connection block diagram of the CUDA-based image preprocessing system 100 according to an embodiment of the present invention. As shown in fig. 5, in a specific embodiment, the CUDA-based image preprocessing system 100 includes a CPU10 and a GPU20 connected to the CPU10, and the CPU10 is configured to acquire a single frame image, read pixel value data of all pixels in the single frame image, and receive preprocessed image data obtained by the above-described image preprocessing method. The GPU20 has a CUDA kernel programmed in advance, and the GPU20 is configured to acquire a single frame image and pixel value data of all pixels in the single frame image read by the CPU 10; then carrying out zooming operation on the single-frame image to obtain a zoomed image; and then calling a CUDA kernel function to calculate the pixel value of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU10 by using the CUDA kernel function.

In this embodiment, a kernel function defining the CUDA is first designed, and the kernel function defines the functions to be implemented by a single thread and is a core for CUDA programming. In this embodiment, the functions to be performed by the kernel function are the calculation of the pixel value size of the individual pixels after image size scaling, and the subsequent calculation of normalization and centering. After the kernel function is defined, the memory is allocated on the host according to the size of the data to be processed, and the data is initialized. In this embodiment, the host refers to the CPU10 and the memory of the system, and the data is pixel data of an image that we want to process. Similarly, memory is allocated on the device according to the size of the data, and the data is copied from the host to the device. The device in this embodiment refers to the GPU20 and its memory, and is to implement communication between the CPU10 and the GPU20, and transmit image data from the CPU10 to the GPU 20. Then, a kernel function of the CUDA is called on the device, the number of thread blocks and the number of threads contained in a single thread block are configured, the kernel function is activated, and a plurality of threads are started to execute programs in the kernel function, so that the function written by the code is realized. After the kernel function finishes the operation, the operation result is transmitted from the device back to the host computer, and the operation result in this embodiment refers to the pixel value data of the image after the preprocessing operations of size scaling, normalization and centering. To avoid memory leaks, the previously allocated memory on the host and device is freed after the operation is complete.

In the image preprocessing method based on CUDA provided in this embodiment, for an input single frame image, first, a memory is allocated on the CPU10 according to the size of the image, the CPU10 is used to read pixel value data of all pixels of the image, then, a corresponding memory is allocated on the GPU20 according to the size of the image, and then, the image is copied from the CPU10 to the GPU 20. The method comprises the steps of carrying out scaling operation on the size of an image, calling a pre-designed and written CUDA kernel function, carrying out multi-thread parallel computation on the GPU20 to calculate the pixel value size of each pixel point with the scaled size on three channels of RGB, carrying out normalization computation on the values of the three channels of a single pixel at the same time, then respectively carrying out centralized computation, and then copying the pre-processed image data from the GPU20 to the CPU 10. And finally, the memory previously allocated on CPU10 and GPU20 is released.

In the image preprocessing method based on the CUDA provided in this embodiment, since each step of the preprocessing stage calls the CUDA to perform parallel computation on all pixel values on the GPU20, compared with the conventional method for serially computing pixel values by using a single pixel, the computation speed is significantly increased, and the overall computation time is greatly reduced.

Thus, it should be appreciated by those skilled in the art that while a number of exemplary embodiments of the invention have been illustrated and described in detail herein, many other variations or modifications consistent with the principles of the invention may be directly determined or derived from the disclosure of the present invention without departing from the spirit and scope of the invention. Accordingly, the scope of the invention should be understood and interpreted to cover all such other variations or modifications.

Claims

1. A CUDA-based image preprocessing method is applied to a GPU side, and is characterized by comprising the following steps:

calling the CUDA kernel function which is written in advance, and sequentially calculating the pixel value size of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function.

2. The image preprocessing method according to claim 1, wherein after calling the CUDA kernel function to calculate in parallel the pixel value sizes of the respective pixels in the scaled image from the pixel value data of the respective pixels read by the CPU using the CUDA kernel function, further comprising the steps of:

3. The method of claim 2, wherein after the normalizing and centering of the pixel values of the pixels in the scaled image to obtain the pre-processed image data, the method further comprises:

and transmitting the preprocessed image data to a CPU.

4. The image preprocessing method according to claim 3, further comprising the steps of: allocating a memory on the GPU according to the single-frame image;

and releasing the memory allocated on the GPU.

5. The image pre-processing method according to any of claims 1 to 4, wherein the kernel function, when executing on the GPU, simultaneously starts all thread blocks, each of the thread blocks being composed of a plurality of threads, each of the threads performing a pixel value calculation of one pixel in one of the scaled images.

6. The image preprocessing method of claim 5,

the number of all the thread blocks is m × n, wherein m represents the quotient of the width of the single frame image and the number of columns of the threads contained in the thread block, and n represents the quotient of the height of the single frame image and the number of rows of the threads contained in the thread block.

7. An image preprocessing method based on CUDA is characterized in that the image preprocessing method is applied to a CPU side, and the image preprocessing method comprises the following steps:

acquiring a single-frame image;

reading pixel value data of all pixels in the single-frame image;

receiving pre-processed image data obtained by the image pre-processing method of any one of claims 2-6.

8. The image preprocessing method according to claim 7, further comprising the steps of: allocating memory on the CPU according to the single-frame image;

9. A CUDA-based image preprocessing system, comprising:

a CPU configured to acquire a single frame image and read pixel value data of all pixels in the single frame image, and receive preprocessed image data obtained by the image preprocessing method according to any one of claims 2 to 6;

the GPU is connected with the CPU and is provided with a pre-programmed CUDA kernel function, and the GPU is configured to acquire a single-frame image and pixel value data of all pixels in the single-frame image read by the CPU; then carrying out zooming operation on the single-frame image to obtain a zoomed image; and then calling the CUDA kernel function to calculate the pixel value of each pixel in the zoomed image in parallel according to the pixel value data of each pixel read by the CPU by using the CUDA kernel function.

10. The image pre-processing system of claim 9,

the GPU is also configured to respectively perform normalization and centering calculation on the pixel value of each pixel in the zoomed image so as to obtain preprocessed image data.