CN110969565B

CN110969565B - Image processing method and device

Info

Publication number: CN110969565B
Application number: CN201811142940.9A
Authority: CN
Inventors: 屠震元
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2023-05-16
Anticipated expiration: 2038-09-28
Also published as: CN110969565A

Abstract

The application discloses an image processing method and device, and belongs to the technical field of computers. The method comprises the following steps: determining the size of a first image to be processed, and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks; according to the preset thread distribution information, determining the number of thread blocks which can be processed by a single SM one-time task of the GPU; determining the total use efficiency of the SMs according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by the single SM one-time task; and determining thread distribution information to be used of a thread block according to the SM total use efficiency, and executing image processing on the first image based on the thread distribution information to be used. By adopting the method and the device, the hardware resources of the SM can be fully utilized, so that the efficiency of image processing is improved.

Description

Image processing method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for image processing.

Background

With the development of science and technology, machine learning is involved in various fields, and deep learning is a leading-edge branch of the machine learning field, and has been rapidly developed in recent years. The convolutional neural network model is an algorithm model widely applied in deep learning, and the convolutional neural network model is usually used as a backbone in an image processing model.

The complexity of the convolutional neural network is continuously improved, the calculated amount is also increased, but the computer resources are limited, so that the problem that the relevant technicians need to solve is solved by improving the image processing efficiency of the convolutional neural network model under the condition that the computer resources are certain.

In the related art, threads to be processed by an SM (Streaming Multiprocessor ) in a GPU (Graphics Processing Unit, graphics processor) are pre-partitioned to obtain thread blocks containing a certain number of threads, and then the thread blocks are allocated to the SM for processing.

In the process of implementing the present application, the inventors found that the related art has at least the following problems:

in the related art, for one image processing, the processing method can be divided into a plurality of tasks, each task distributes thread blocks for the SMs according to the processing capacity of the SMs to process, the processing capacity of each SM is the same, the time for completing the task is very similar, and the time can be approximately considered to be the same. A situation often occurs that, when the last task is allocated, only a small number of thread blocks remain to be processed, and in the last task, the small number of thread blocks are allocated to only a small portion of SMs, and the rest SMs are in idle states, so that the efficiency of image processing is low.

Disclosure of Invention

In order to solve the problems in the prior art, an embodiment of the application provides an image processing method and device. The technical scheme is as follows:

in a first aspect, there is provided a method of image processing, the method comprising:

determining the size of a first image to be processed, and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

according to the preset thread distribution information, determining the number of thread blocks which can be processed by a single SM one-time task of the GPU;

determining the total use efficiency of the SMs according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by the single SM one-time task;

and determining thread distribution information to be used of a thread block according to the SM total use efficiency, and executing image processing on the first image based on the thread distribution information to be used.

Optionally, the determining the thread distribution information to be used of the thread block according to the SM total use efficiency includes:

if the SM total use efficiency is greater than a preset threshold value, determining the preset thread distribution information as thread distribution information to be used;

and if the SM total use efficiency is smaller than a preset threshold value, carrying out numerical adjustment on the preset thread distribution information according to a preset adjustment rule, and turning to executing the determination of the number of the thread blocks to be processed according to the number of the threads and the preset thread distribution information of the thread blocks.

Optionally, the thread block is a multidimensional thread block, the thread distribution information includes the number of threads corresponding to each dimension, and the performing numerical adjustment on the preset thread distribution information according to a preset adjustment rule includes:

determining the least common multiple of the number of the thread blocks to be processed and the number of SMs in the GPU;

determining the quotient of the least common multiple divided by the number of the thread blocks to be processed as the task execution times required for executing image processing on the first image;

and determining the adjusted thread number corresponding to the dimension to be adjusted based on the thread number corresponding to the dimension to be adjusted in the preset thread distribution information and the task execution times, and carrying out numerical adjustment on the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted.

Optionally, the performing numerical adjustment on the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted includes:

if the change proportion of the thread number corresponding to the dimension to be adjusted relative to the thread number corresponding to the dimension to be adjusted is larger than a preset change proportion, readjusting the thread number corresponding to the dimension to be adjusted according to the preset change proportion to obtain the readjusted thread number corresponding to the dimension to be adjusted, and adjusting the thread number corresponding to the dimension to be adjusted in the preset thread distribution information to be the readjusted thread number;

if the change proportion of the number of threads corresponding to the dimension to be adjusted relative to the number of threads corresponding to the dimension to be adjusted is smaller than a preset change proportion, the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information is adjusted to be the adjusted number of threads.

Optionally, the determining, according to the preset thread distribution information, the number of thread blocks that can be processed by a single SM one task of the GPU includes:

determining the number of threads in a thread block according to the preset thread distribution information;

and determining the number of thread blocks which can be processed by a single SM task of the GPU according to the number of threads in the thread blocks and the preset number of threads which can be processed by a single SM of the GPU in one cycle.

Optionally, after performing image processing on the first image based on the thread distribution information to be used, the method further includes:

when an image processing instruction for a second image is received, if the second image is the same as the first image in size, image processing is performed on the second image based on the thread distribution information to be used.

Optionally, the image processing is image convolution processing.

In a second aspect, there is provided an apparatus for image processing, the apparatus comprising:

the determining module is used for determining the size of a first image to be processed and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

and the execution module is used for determining thread distribution information to be used of a thread block according to the SM total use efficiency, and executing image processing on the first image based on the thread distribution information to be used.

Optionally, the determining module is configured to:

Optionally, the thread block is a multidimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the determining module is configured to:

Optionally, the determining module is configured to:

Optionally, the execution module is further configured to:

Optionally, the image processing is image convolution processing.

In a third aspect, there is provided a computer device comprising a processor and a memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of image processing as described in the first aspect above.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one instruction loaded and executed by the processor to implement the method of image processing as described in the first aspect above.

The beneficial effects that technical scheme that this application embodiment provided include at least:

in the embodiment of the application, based on the number of threads corresponding to the image processing to be executed, the preset thread distribution information of the thread blocks and the SM related information in the GPU, the SM total use efficiency is obtained, and then the thread distribution information of the thread blocks is adjusted according to the SM total use efficiency, so that the hardware resources of the SM can be fully utilized, and the image processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of image processing provided in an embodiment of the present application;

FIG. 2 is a flow chart of a method of image processing provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an apparatus for image processing according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computer device according to an embodiment of the present application;

fig. 5 is a schematic diagram of thread distribution information of a thread block according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides an image processing method which can be realized by computer equipment. Wherein the computer device may be a computer with image processing functions. The image processing may be image convolution processing, image scaling processing, matting processing, etc., and in this embodiment of the present application, the image convolution processing in the convolutional neural network processing process is described as an example, and other image processing is similar to this, and will not be described herein.

As shown in fig. 1, the process flow of the method may include the following steps:

in step 101, the size of a first image to be processed is determined, and the number of thread blocks to be processed is determined according to the size of the first image and the preset thread distribution information of the thread blocks.

Wherein, as shown in fig. 5, the thread block is composed of a certain number of threads, and the thread distribution information may include the thread number dim.x in the x dimension of the thread block, the thread number dim.y in the y dimension, and the thread number dim.z in the z dimension. The preset thread distribution information may be set to dim.x=32, dim.y=8, and dim.z=8.

In implementation, in the process of performing convolutional neural network processing, a GPU in a computer device acquires a first image which needs to be subjected to image convolutional processing, a processing task of the first image is included in a thread, the GPU allocates the thread to a module SM which actually executes the task in the thread for execution, and the GPU allocates the thread in units of thread blocks when the thread is allocated, so that a technician can set thread distribution information in the thread blocks in advance. Wherein the thread distribution information in each thread block is the same. The number of threads in the x dimension of the thread block, dim.x, the number of threads in the y dimension, dim.y, and the number of threads in the z dimension, dim_thread, i.e., num_thread=dim.x.y.dim.z, can be obtained by multiplying the number of threads in the x dimension of the thread block by the number of threads in the z dimension. Then, when processing the first image, what the number of thread blocks to be processed in the GPU is can be calculated by the following method.

The computer equipment divides the width M of the first image, the height N of the first image and the channel number K of the first image by the thread number dim.x of the thread blocks in the x dimension, the thread number dim.y of the y dimension and the thread number dim.z of the z dimension respectively to obtain the thread block number of each dimension on grid, wherein grid is the organization form of all threads and can be regarded as a set of all threads. It should be noted that, the width M of the first image, the height N of the first image, and the channel number K of the first image may be divided by the thread number dim.x of the thread block x dimension, the thread number dim.y of the y dimension, and the thread number dim.z of the z dimension, and the specific operation may be that M is divided by dim.x, N is divided by dim.y, K is divided by dim.z, or that M is divided by dim.y, N is divided by dim.x, K is divided by dim.z, or that M is divided by dim.z, N is divided by dim.y, K is divided by dim.x, or that other similar dividing manners other than the above three manners may be used, which manner is not described herein, and may be determined according to the actual situation.

And multiplying the number of thread blocks grid_dim.x in the x dimension by the number of thread blocks grid_dim.y in the y dimension by the number of thread blocks grid_dim.z in the z dimension to obtain the total number of thread blocks num_block in the grid, namely the number of thread blocks to be processed in the GPU when the first image processing is carried out.

In step 102, the number of thread blocks that can be processed by a single SM task of the GPU is determined according to the preset thread distribution information.

In implementation, because the SM uses the hardware resources such as the shared memory and the register in the SM when processing the thread, the number of threads which can be processed in one SM task is limited by the hardware resources, and the GPU takes the thread block as a unit when assigning the thread to the SM, so that the number of thread blocks which can be processed in one SM task can be determined according to the number of threads in the thread block.

Optionally, the number of threads in the thread block may be determined according to preset thread distribution information; and determining the number of thread blocks which can be processed by a single SM of the GPU by one task according to the number of threads in the thread blocks and the preset number of threads which can be processed by the single SM of the GPU in one cycle.

Here, the SM processes threads in units of Warp (thread bundle), and the number of threads in a single Warp is 32.

In implementation, a technician may preset the number of threads that can be processed in one cycle of a single SM according to the hardware resource situation in the actual SM, so as to obtain the number of Warp that can be processed in one cycle of the single SM, and in this embodiment, the number of threads that can be processed in one cycle of the single SM is set to 512, that is, the number of Warp that can be processed in one cycle of the single SM is set to 16. Step 101 can be known, according to the thread distribution information in the preset thread block, the thread number dim.x in the x dimension, the thread number dim.y in the y dimension, and the thread number dim.z in the z dimension of the thread block are multiplied, so as to obtain the thread number num_thread in the single thread block. And then according to the obtained number of threads num_thread in the single thread block and the obtained number of threads in the single Warp, calculating to obtain the Warp number in the single thread block, wherein the specific calculation formula is as follows: num_warp=num_thread/32. And calculating the number of thread blocks, which can be processed by a single SM once task, according to the number of threads in a single Warp and the number of Warp which can be processed by a single SM in one cycle, wherein the specific calculation formula is as follows:

saturating_block_per_sm＝(16+num_warp-1)/num_warp。

in this embodiment, the number of threads in the thread block may be an integer multiple of 32, so that the SM may avoid the problem of wasting resources when the SM processes threads by less than 32 threads.

In step 103, the SM total usage efficiency is determined according to the number of thread blocks to be processed and the number of thread blocks that can be processed by a single SM for one task.

In implementation, a technician may obtain the SM number num_sm in the GPU according to the actual hardware information of the GPU, and then calculate the SM number num_sm and the thread block number diagnostic_block_per_sm that can be processed by a single SM task to obtain the thread block number diagnostic_block that can be processed by all SM tasks in the GPU, where a specific calculation formula is as follows:

saturating_block＝num_sm*saturating_block_per_sm。

if the number of the thread blocks actually processed in one task of the SM reaches the number of the thread blocks which can be processed by one task of the SM, the SM can be called to run in saturation. And then according to the number num_block of thread blocks to be processed in the GPU and the number of thread blocks which can be processed by all SM one-time tasks in the GPU, calculating the number full_times of saturated operation of all SMs in the GPU, wherein the specific calculation formula is as follows: full_times=num_block/sampling_block, and if the calculation result is not an integer, full_times are integers. Similarly, according to the number num_block of the thread blocks to be processed in the GPU and the number of thread blocks which can be processed by all SM one-time tasks in the GPU, the number remain_block of the thread blocks to be processed after full_time processing is calculated, and the specific calculation formula is as follows:

reminder_block=num_block% sampling_block, where% is the remainder symbol.

Then, according to the remaining thread blocks to be processed, and the number of the thread blocks that can be processed in the same period by all SMs in the GPU, calculating to obtain the saturation rate last_efficiency_sm of the last operation of the SMs when the remaining thread blocks to be processed are processed, wherein the specific calculation formula is as follows: efficiency_sm=holder_block/patterning_block. Next, it is also necessary to determine whether or not there are any remaining thread blocks to be processed after the full_times of processing, and determine the total running times total_times of SM, if there are, total_times=full_times+1, and if there are no, total_times=full_times. Finally, calculating to obtain the total using efficiency of the SM according to the total running times total_time of the SM, the saturation rate last_efficiency_sm of the last running of the SM and the cycle number full_time of the saturated running of the SM, wherein the specific calculation formula is as follows:

efficiency_sm＝(full_times+last_efficiency_sm)/total_times。

in step 104, thread distribution information to be used of the thread block is determined according to the SM total use efficiency, and image processing is performed on the first image based on the thread distribution information to be used.

In implementation, the total SM use efficiency can be obtained through the calculation, whether the thread distribution information of the current thread block is reasonable or not can be judged according to the total SM use efficiency, namely whether the total SM use efficiency reaches the expected or not is judged, so that the thread distribution information of the reasonable thread block to be used is determined, and the first image processing is carried out according to the thread distribution information of the thread block.

As shown in fig. 2, the processing flow of the image processing method may be as follows:

step 201, determining the size of a first image to be processed.

Step 202, determining the number of thread blocks to be processed according to the size of the first image and the preset thread distribution information of the thread blocks.

And 203, determining the number of thread blocks which can be processed by a single SM task of the GPU according to preset thread distribution information.

Step 204, determining the total use efficiency of the SMs according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by a single SM one-time task.

Step 205, determining whether the SM total usage efficiency is greater than a preset threshold. If yes, step 206 is performed, and if not, step 207 and subsequent steps are performed.

In practice, the technician may set a threshold value for the SM total usage efficiency in advance according to the actual needs of the first image processing, the threshold value may be set to 90%, if the threshold value is too low, it is not worth, and if the threshold value is too high, it may not be reached. And comparing the SM total use efficiency with the preset threshold value, if the SM total use efficiency is larger than the preset threshold value, taking the preset thread distribution information as the thread distribution information to be used, and if the SM total use efficiency is smaller than the preset threshold value, adjusting the preset thread distribution information.

Step 206, determining the preset thread distribution information as thread distribution information to be used, and executing image processing on the first image according to the thread distribution information to be used. After step 206 is performed, the flow ends.

Step 207, determining the least common multiple new_block of the number of thread blocks to be processed and the number of SMs in the GPU.

Step 208, determining the quotient of the least common multiple divided by the number of thread blocks to be processed as the number of task executions required to execute image processing on the first image.

In implementation, according to the least common multiple new_block and the number of thread blocks num_block to be processed obtained in step 207, the number of task execution times num_times required for processing the first image is calculated, where a specific calculation formula is as follows: nun_times=new_block/num_block.

Step 209, determining the adjusted thread number corresponding to the dimension to be adjusted based on the thread number and the task execution times corresponding to the dimension to be adjusted in the preset thread distribution information, and performing numerical adjustment on the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted.

In implementation, according to the thread count dim.y corresponding to the dimension to be adjusted in the preset thread distribution information (here, the dimension to be adjusted may be an x dimension, may be a y dimension, or may also be a z dimension, the y dimension is taken as an example for illustration in this embodiment, and the processing of other dimensions is similar to the y dimension, which is not described herein, because the thread count corresponding to the preset x dimension in this embodiment) and the task execution count num_times required for processing the first image, the adjusted thread count new_y corresponding to the y dimension is calculated, where a specific calculation formula is: new_y= (dim.y+num_times-1)/num_times, and then performing numerical adjustment on dim.y in preset thread distribution information according to the adjusted thread number new_y corresponding to the y dimension.

Step 210, determining whether the change ratio of the adjusted thread number corresponding to the dimension to be adjusted to the thread number corresponding to the dimension to be adjusted is greater than a preset change ratio. If yes, go to step 211, if no, go to step 202.

Step 211, readjusting the number of threads corresponding to the dimension to be adjusted according to the preset change proportion to obtain the readjusted number of threads corresponding to the dimension to be adjusted, adjusting the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information, and turning to the execution step 202.

In implementation, the number of threads corresponding to the dimension to be adjusted is adjusted to be the number of threads which just meet the preset change proportion, and the number of threads is used as the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information. And then goes to step 202.

It should be noted that, in steps 207-211, a processing method for performing numerical adjustment on preset thread distribution information according to a preset adjustment rule is provided, and other methods may be used in the embodiments of the present application to perform numerical adjustment on preset thread distribution information, for example, performing adjustment according to a fixed numerical value. In addition, a processing method for controlling the adjustment degree of the preset thread distribution information is provided in steps 210 to 211, and in this embodiment of the present application, the adjustment degree of the preset thread distribution information may not be controlled by this method, and after the adjustment of the preset thread distribution information, the process may directly go to the execution step 202, or other methods may also be used to control the adjustment degree.

After the execution of any one of the two processing flows is completed, when the GPU of the computer device receives an execution instruction of the second image processing, if the second image is detected to be the same as the first image in size, the image processing can be executed on the second image directly by using the thread distribution information to be used obtained in the processing flow.

Based on the same technical concept, the embodiment of the present application further provides an apparatus for image processing, where the apparatus may be the computer device in the foregoing embodiment, as shown in fig. 3, and the apparatus includes: a determination module 310 and an execution module 320.

A determining module 310, configured to determine a size of a first image to be processed, and determine a number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

and the execution module 320 is configured to determine thread distribution information to be used of a thread block according to the SM total use efficiency, and execute image processing on the first image based on the thread distribution information to be used.

Optionally, the determining module 310 is configured to:

Optionally, the thread block is a multidimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the determining module 310 is configured to:

Optionally, the determining module 310 is configured to:

Optionally, the execution module is further configured to:

Optionally, the image processing is image convolution processing.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

It should be noted that: in the image processing apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the image processing apparatus provided in the above embodiment and the image processing method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application, where the computer device 400 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 401 and one or more memories 402, where at least one instruction is stored in the memories 402, and the at least one instruction is loaded and executed by the processors 401 to implement the above-mentioned image processing method.

In an exemplary embodiment, there is also provided a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of identifying an action category in the above-described embodiments. For example, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of image processing, the method comprising:

determining the size of a first image to be processed, determining the number of thread blocks corresponding to a plurality of dimensions respectively according to the size of the first image and preset thread distribution information of the thread blocks, and determining the number of thread blocks to be processed according to the number of thread blocks corresponding to the plurality of dimensions respectively, wherein the size of the first image comprises the length, the width and the channel number of the first image, the preset thread distribution information comprises the number of threads of the thread blocks in the plurality of dimensions, and the plurality of dimensions comprise an x dimension, a y dimension and a z dimension;

determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the number of threads in the thread blocks and the number of threads which can be processed by a single SM one-time period of the preset GPU;

2. The method according to claim 1, wherein determining thread distribution information to be used for a thread block according to the SM total use efficiency comprises:

3. The method according to claim 2, wherein the thread block is a multidimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the performing numerical adjustment on the preset thread distribution information according to a preset adjustment rule includes:

4. The method of claim 3, wherein the performing numerical adjustment on the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted includes:

5. The method according to claim 1, wherein after performing image processing on the first image based on the thread distribution information to be used, further comprising:

6. The method according to any one of claims 1-5, wherein the image processing is image convolution processing.

7. An apparatus for image processing, the apparatus comprising:

the device comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining the size of a first image to be processed, determining the number of thread blocks corresponding to a plurality of dimensions respectively according to the size of the first image and preset thread distribution information of the thread blocks, and determining the number of the thread blocks to be processed according to the number of the thread blocks corresponding to the plurality of dimensions respectively, wherein the size of the first image comprises the length, the width and the channel number of the first image, the preset thread distribution information comprises the thread number of the thread blocks in the plurality of dimensions, and the plurality of dimensions comprise an x dimension, a y dimension and a z dimension;

8. The apparatus of claim 7, wherein the means for determining is configured to:

9. The apparatus of claim 8, wherein the thread blocks are multidimensional thread blocks, the thread distribution information includes a number of threads corresponding to each dimension, and the determining module is configured to:

10. The apparatus of claim 9, wherein the determining module is configured to:

11. The apparatus of claim 7, wherein the execution module is further to:

12. The apparatus according to any one of claims 7-11, wherein the image processing is an image convolution processing.