CN110969565A

CN110969565A - Image processing method and device

Info

Publication number: CN110969565A
Application number: CN201811142940.9A
Authority: CN
Inventors: 屠震元
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2020-04-07
Anticipated expiration: 2038-09-28
Also published as: CN110969565B

Abstract

The application discloses an image processing method and device, and belongs to the technical field of computers. The method comprises the following steps: determining the size of a first image to be processed, and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks; determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the preset thread distribution information; determining the total SM use efficiency according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by the single SM one-time task; and determining the thread distribution information to be used of the thread block according to the total SM use efficiency, and executing image processing on the first image based on the thread distribution information to be used. By adopting the method and the device, the hardware resource of the SM can be fully utilized, so that the image processing efficiency is improved.

Description

Image processing method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for image processing.

Background

With the development of science and technology, machine learning is involved in many fields, and deep learning is taken as the forefront branch of the machine learning field and is rapidly developed in recent years. The convolutional neural network model is an algorithm model widely applied in deep learning, and is usually used as a backbone in an image processing model.

The complexity of the convolutional neural network is continuously improved, the calculation amount is also increased, but the computer resources are limited, so how to improve the image processing efficiency of the convolutional neural network model under the condition that the computer resources are certain, and the problem that the related technical staff needs to solve urgently is also formed.

In the related art, threads to be processed by an SM (streaming multi-processor) in a GPU (Graphics Processing Unit) are pre-blocked to obtain thread blocks including a certain number of threads, and the thread blocks are allocated to the SM for Processing.

In the course of implementing the present application, the inventors found that the related art has at least the following problems:

in the related art, for one image processing, the image processing can be divided into a plurality of tasks, each task allocates a thread block to the SM for processing according to the processing capability of the SM, the processing capabilities of the SMs are the same, and the time for completing the tasks is very close to each other and can be approximately considered as the same time. There is often a situation that only a small number of thread blocks remain to be processed when the last task is assigned, so that the small number of thread blocks can only be assigned to a small part of SMs in the last task, and the rest of SMs are in an idle state, which results in low efficiency of image processing.

Disclosure of Invention

In order to solve the problems in the prior art, embodiments of the present application provide a method and an apparatus for image processing. The technical scheme is as follows:

in a first aspect, a method of image processing is provided, the method comprising:

determining the size of a first image to be processed, and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the preset thread distribution information;

determining the total SM use efficiency according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by the single SM one-time task;

and determining the thread distribution information to be used of the thread block according to the total SM use efficiency, and executing image processing on the first image based on the thread distribution information to be used.

Optionally, the determining, according to the total SM usage efficiency, thread distribution information to be used of a thread block includes:

if the total SM use efficiency is greater than a preset threshold value, determining the preset thread distribution information as thread distribution information to be used;

and if the total SM use efficiency is smaller than a preset threshold value, carrying out numerical value adjustment on the preset thread distribution information according to a preset adjustment rule, and switching to executing the step of determining the number of the thread blocks to be processed according to the number of the threads and the preset thread distribution information of the thread blocks.

Optionally, the thread block is a multidimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the performing a numerical adjustment on the preset thread distribution information according to a preset adjustment rule includes:

determining the least common multiple of the number of the thread blocks to be processed and the number of SMs in the GPU;

determining the quotient of the least common multiple divided by the number of the thread blocks to be processed as the number of task executions required for executing the image processing on the first image;

determining the number of threads after adjustment corresponding to the dimension to be adjusted based on the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information and the number of task execution times, and performing numerical adjustment on the preset thread distribution information based on the number of threads after adjustment corresponding to the dimension to be adjusted.

Optionally, the performing numerical adjustment on the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted includes:

if the change proportion of the number of the threads after adjustment corresponding to the dimension to be adjusted relative to the number of the threads corresponding to the dimension to be adjusted is larger than a preset change proportion, readjusting the number of the threads corresponding to the dimension to be adjusted according to the preset change proportion to obtain the number of the threads after readjustment corresponding to the dimension to be adjusted, and adjusting the number of the threads corresponding to the dimension to be adjusted in the preset thread distribution information to the number of the threads after readjustment;

if the change proportion of the number of the threads corresponding to the dimension to be adjusted after adjustment relative to the number of the threads corresponding to the dimension to be adjusted is smaller than a preset change proportion, adjusting the number of the threads corresponding to the dimension to be adjusted in the preset thread distribution information to the number of the threads after adjustment.

Optionally, the determining, according to the preset thread distribution information, the number of thread blocks that can be processed by a single SM task of the GPU includes:

determining the number of threads in the thread block according to the preset thread distribution information;

and determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the number of threads in the thread blocks and the number of threads which can be processed by a single SM one cycle of the preset GPU.

Optionally, after performing image processing on the first image based on the thread distribution information to be used, the method further includes:

when an image processing instruction for a second image is received, if the second image is the same as the first image in size, performing image processing on the second image based on the thread distribution information to be used.

Optionally, the image processing is image convolution processing.

In a second aspect, there is provided an apparatus for image processing, the apparatus comprising:

the determining module is used for determining the size of a first image to be processed and determining the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

and the execution module is used for determining the thread distribution information to be used of the thread block according to the total SM use efficiency, and executing image processing on the first image based on the thread distribution information to be used.

Optionally, the determining module is configured to:

Optionally, the thread block is a multidimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the determining module is configured to:

Optionally, the determining module is configured to:

Optionally, the execution module is further configured to:

Optionally, the image processing is image convolution processing.

In a third aspect, there is provided a computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to implement the method of image processing as described in the first aspect above.

In a fourth aspect, there is provided a computer readable storage medium having stored therein at least one instruction, which is loaded and executed by the processor to implement the method of image processing according to the first aspect.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, the total SM utilization efficiency is obtained based on the number of threads corresponding to image processing to be executed, the preset thread distribution information of the thread block and the SM related information in the GPU, and the thread distribution information of the thread block is adjusted according to the total SM utilization efficiency, so that hardware resources of the SM can be fully utilized, and the image processing efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for image processing according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of image processing provided by an embodiment of the present application;

fig. 3 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application;

fig. 5 is a schematic view of thread distribution information of a thread block according to an embodiment of the present disclosure.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The embodiment of the application provides an image processing method, which can be realized by computer equipment. Wherein the computer device may be a computer having an image processing function. The image processing may be image convolution processing, image scaling processing, matting processing, and the like, and the image convolution processing in the convolutional neural network processing is taken as an example in the embodiment of the present application for description, and other image processing is similar to this and is not described herein again.

As shown in fig. 1, the processing flow of the method may include the following steps:

in step 101, the size of a first image to be processed is determined, and the number of thread blocks to be processed is determined according to the size of the first image and preset thread distribution information of the thread blocks.

As shown in fig. 5, the thread block is composed of a certain number of threads, and the thread distribution information may include a thread number dim.x in the x dimension, a thread number dim.y in the y dimension, and a thread number dim.z in the z dimension of the thread block. The preset thread distribution information may be set to dim.x ═ 32, dim.y ═ 8, and dim.z ═ 8.

In implementation, in the process of performing convolutional neural network processing, a GPU in a computer device acquires a first image that needs to be subjected to image convolutional processing, a processing task for the first image is included in a thread, the GPU allocates the thread to a module SM that actually executes the task in the thread to execute, and the module SM allocates the thread in units of thread blocks when allocating the thread, so that a technician can set thread distribution information in the thread blocks in advance. Wherein, the thread distribution information in each thread block is the same. The number of threads dim.x in the x dimension, the number of threads dim.y in the y dimension, and the number of threads dim.z in the z dimension are multiplied to obtain the number of threads num _ thread in a single thread block, that is, num _ thread is dim.x dim.y dim.z. Then, what the number of thread blocks to be processed in the GPU is when processing the first image can be calculated by the following method.

The computer device divides the width M of the first image, the height N of the first image, and the channel number K of the first image by the number dim.x of threads of the thread block in the x dimension, the number dim.y of threads in the y dimension, and the number dim.z of threads in the z dimension, respectively, to obtain the number of thread blocks in each dimension on grid (grid), where grid is an organization form of all threads and can be regarded as a set of all threads. It should be noted that, dividing the width M of the first image, the height N of the first image, and the channel number K of the first image by the thread number dim.x of the x dimension of the block, the thread number dim.y of the y dimension, and the thread number dim.z of the z dimension, respectively, may be specifically dividing M by dim.x, N by dim.y, and K by dim.z, or dividing M by dim.y, N by dim.x, and K by dim.z, alternatively, M divided by dim.z, N divided by dim.y, K divided by dim.x, there may be other similar divisions in addition to the above three, which way can be determined according to the actual situation, which is not described herein in detail, in this embodiment, dividing M by dim.x by using the most common division mode to obtain the number grid _ dim.x of thread blocks of grid in the x dimension, dividing N by dim.y to obtain the number grid _ dim.y of thread blocks of grid in the y dimension, and dividing K by dim.z to obtain the number grid _ dim.z of thread blocks of grid in the z dimension.

And multiplying the grid by the grid thread block number grid _ dim.x in the x dimension, the grid thread block number grid _ dim.y in the y dimension and the grid thread block number grid _ dim.z in the z dimension to obtain the total thread block number num _ block in the grid, namely the number of the thread blocks to be processed in the GPU during first image processing.

In step 102, the number of thread blocks that can be processed by a single SM one-time task of the GPU is determined according to preset thread distribution information.

In implementation, when the SM processes a thread, hardware resources such as a shared memory and a register in the SM are used, so that the number of threads that can be processed in one task of the SM is limited by the hardware resources, and when the GPU allocates a thread to the SM, the number of thread blocks that can be processed in one task of a single SM can be determined according to the number of threads in the thread blocks.

Optionally, the number of threads in the thread block may be determined according to preset thread distribution information; and determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the number of threads in the thread blocks and the preset number of threads which can be processed by a single SM one cycle of the GPU.

Where, the SM is in units of Warp (thread bundle) when processing threads, and the number of threads in a single Warp is 32.

In implementation, a technician may preset the number of threads that can be processed by a single SM in one cycle according to an actual hardware resource situation in the SM, that is, the number of warps that can be processed by a single SM in one cycle is obtained, and in this embodiment, the number of threads that can be processed by a single SM in one cycle is set to 512, that is, the number of warps that can be processed by a single SM in one cycle is set to 16. Step 101 shows that the number of threads num _ thread in a single thread block can be obtained by multiplying the number of threads dim.x in the x dimension, the number of threads dim.y in the y dimension, and the number of threads dim.z in the z dimension according to the thread distribution information in the preset thread block. And then calculating the number of Warp in the single thread block according to the obtained number of threads num _ thread in the single thread block and the obtained number of threads in the single Warp, wherein the specific calculation formula is as follows: num _ warp is num _ thread/32. And then calculating the number of thread blocks, namely _ block _ per _ SM, which can be processed by a single SM at one time according to the number of threads in the single Warp and the number of warps which can be processed by the single SM in one period, wherein the specific calculation formula is as follows:

saturating_block_per_sm＝(16+num_warp-1)/num_warp。

it should be noted that, when processing threads, the SM is based on the unit of Warp, and the number of threads in a single Warp is 32, so in this embodiment, the number of threads in a thread block may be an integer multiple of 32, which can avoid the situation that when processing threads, the SM has less than 32 threads, and in such a situation, the SM generates a problem of resource waste.

In step 103, the total SM utilization efficiency is determined based on the number of thread blocks to be processed and the number of thread blocks that can be processed by a single SM task at a time.

In implementation, a technician may obtain the SM number num _ SM in the GPU according to actual hardware information of the GPU, and then calculate the number of thread blocks that can be processed by all SM tasks in the GPU from the SM number num _ SM and the number of thread blocks that can be processed by a single SM task at a time, namely, the number of thread blocks that can be processed by all SM tasks at a time, namely, the number of thread blocks that can be processed by a single SM task at a time, according to a specific calculation formula:

saturating_block＝num_sm*saturating_block_per_sm。

an SM may be said to run saturated if the number of thread blocks actually processed in one task of the SM reaches the number of thread blocks that it can process in one task. And then calculating the frequency full _ times of saturated operation of all SMs in the GPU according to the number num _ block of the thread blocks to be processed in the GPU and the number of the thread blocks which can be processed by all SM one-time tasks in the GPU, wherein the specific calculation formula is as follows: and the full _ times is num _ block/configuring _ block, and if the calculation result is not an integer, the full _ times is an integer. Similarly, according to the number num _ block of thread blocks to be processed in the GPU and the number of thread blocks that can be processed by all SM tasks in the GPU at one time, the number remaining thread blocks to be processed after full _ times processing is calculated, and the specific calculation formula is:

remaining _ block ═ num _ block% parsing _ block, where% is the remainder symbol.

Then, according to the remaining number of thread blocks to be processed, remaining _ block and the number of thread blocks that can be processed by all SMs in the GPU in the same period, saturrating _ block, when the remaining thread blocks to be processed are processed, the saturation rate last _ efficiency _ SM of the last operation of the SM is calculated, and the specific calculation formula is: the efficiency _ sm is a remaining _ block/parsing _ block. Next, it is further required to determine whether there are remaining thread blocks to be processed after the full _ time processing, and determine to process all the thread blocks to be processed, where the total running time of SM is total _ time, if yes, total _ time is full _ time +1, and if not, total _ time is full _ time. And finally, calculating the total SM use efficiency _ SM according to the total running time total _ time of the SM, the saturation rate last _ efficiency _ SM of the last running of the SM and the cycle number full _ time of the saturation running of the SM, wherein the specific calculation formula is as follows:

efficiency_sm＝(full_times+last_efficiency_sm)/total_times。

in step 104, according to the SM total use efficiency, thread distribution information to be used of the thread block is determined, and image processing is performed on the first image based on the thread distribution information to be used.

In implementation, the total SM use efficiency can be obtained through the calculation, whether the thread distribution information of the current thread block is reasonable or not can be judged according to the total SM use efficiency, namely whether the total SM use efficiency is expected or not is achieved, so that the thread distribution information to be used of the reasonable thread block is determined, and first image processing is carried out according to the thread distribution information of the thread block.

As shown in fig. 2, the processing flow of the image processing method may be as follows:

step 201, determining the size of the first image to be processed.

Step 202, determining the number of thread blocks to be processed according to the size of the first image and the preset thread distribution information of the thread blocks.

And step 203, determining the number of thread blocks which can be processed by a single SM one-time task of the GPU according to the preset thread distribution information.

And step 204, determining the total SM use efficiency according to the number of the thread blocks to be processed and the number of the thread blocks which can be processed by a single SM task at one time.

Step 205, determine whether the total SM usage efficiency is greater than a preset threshold. If so, step 206 is performed, and if not, step 207 and its subsequent steps are performed.

In practice, the technician can set a threshold in advance for the overall SM usage efficiency, which can be set to 90%, according to the actual needs of the first image processing, and if the threshold is too low, it is not worth, and if the threshold is too high, it may not be reached. And comparing the total SM use efficiency with the preset threshold, if the total SM use efficiency is greater than the preset threshold, taking the preset thread distribution information as the thread distribution information to be used, and if the total SM use efficiency is less than the preset threshold, adjusting the preset thread distribution information.

Step 206, determining the preset thread distribution information as the thread distribution information to be used, and performing image processing on the first image according to the thread distribution information to be used. After execution of step 206, the process ends.

Step 207, determining the minimum common multiple new _ block of the number of thread blocks to be processed and the number of SMs in the GPU.

In step 208, the quotient of the least common multiple divided by the number of thread blocks to be processed is determined as the number of task executions required to execute the image processing on the first image.

In implementation, according to the minimum common multiple new _ block obtained in step 207 and the number num _ block of the thread blocks to be processed, the number num _ times of task execution times num _ times required for processing the first image is calculated, and the specific calculation formula is as follows: nun _ times is new _ block/num _ block.

Step 209, determining the number of threads after adjustment corresponding to the dimension to be adjusted based on the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information and the number of task execution times, and performing numerical adjustment on the preset thread distribution information based on the number of threads after adjustment corresponding to the dimension to be adjusted.

In implementation, according to the number dim.y of threads corresponding to the dimension to be adjusted in the preset thread distribution information (where the dimension to be adjusted may be an x dimension, a y dimension, or a z dimension, this embodiment is described by taking the y dimension as an example, and the processing of other dimensions is similar to this, which is not described herein, because the number of threads corresponding to the preset x dimension in this embodiment) and the number num _ times of task execution times required for processing the first image, the adjusted number new _ y of threads corresponding to the y dimension is calculated, and a specific calculation formula is: and adjusting the value of dim.y in the preset thread distribution information according to the adjusted thread number new _ y corresponding to the y dimension.

Step 210, determining whether the change ratio of the number of the adjusted threads corresponding to the dimension to be adjusted relative to the number of the threads corresponding to the dimension to be adjusted is greater than a preset change ratio. If yes, go to step 211, if no, go to step 202.

Step 211, readjusting the number of threads corresponding to the dimension to be adjusted according to the preset change ratio to obtain the number of threads after readjustment corresponding to the dimension to be adjusted, adjusting the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information to the number of threads after readjustment, and going to the step 202.

In implementation, the number of threads corresponding to the dimension to be adjusted is adjusted to the number of threads just meeting the preset change proportion, and the number of threads is used as the number of threads corresponding to the dimension to be adjusted in the preset thread distribution information. Proceed to step 202.

It should be noted that, step 207 to step 211 provide a processing method for performing numerical adjustment on the preset thread distribution information according to the preset adjustment rule, besides this method, other methods may also be used in the embodiment of the present application to perform numerical adjustment on the preset thread distribution information, such as performing adjustment according to a fixed numerical value. In addition, step 210 to step 211 provide a processing method for controlling the adjustment degree of the preset thread distribution information, in this embodiment, the adjustment degree of the preset thread distribution information may not be controlled by this method, and after the preset thread distribution information is adjusted, the step 202 is directly executed, or another method may be used to control the adjustment degree.

After any one of the two processing flows is executed, when the GPU of the computer device receives an execution instruction for processing the second image, if it is detected that the second image has the same size as the first image, the GPU may directly execute image processing on the second image by using the thread distribution information to be used obtained in the processing flows.

Based on the same technical concept, an embodiment of the present application further provides an apparatus for image processing, which may be a computer device in the foregoing embodiment, as shown in fig. 3, and the apparatus includes: a determination module 310 and an execution module 320.

A determining module 310, configured to determine a size of a first image to be processed, and determine the number of thread blocks to be processed according to the size of the first image and preset thread distribution information of the thread blocks;

an executing module 320, configured to determine, according to the SM total usage efficiency, thread distribution information to be used of a thread block, and execute image processing on the first image based on the thread distribution information to be used.

Optionally, the determining module 310 is configured to:

Optionally, the thread block is a multi-dimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the determining module 310 is configured to:

Optionally, the determining module 310 is configured to:

Optionally, the execution module is further configured to:

Optionally, the image processing is image convolution processing.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

It should be noted that: in the image processing apparatus provided in the above embodiment, when processing an image, only the division of the above functional modules is illustrated, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments, and are not described herein again.

Fig. 4 is a schematic structural diagram of a computer device 400 according to an embodiment of the present application, where the computer device 400 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where the memory 402 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 401 to implement the method for image processing.

In an exemplary embodiment, a computer-readable storage medium is further provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for identifying an action category in the above embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of image processing, the method comprising:

2. The method of claim 1, wherein the determining the thread distribution information to be used for the thread block according to the total SM usage efficiency comprises:

3. The method of claim 2, wherein the thread block is a multi-dimensional thread block, the thread distribution information includes a number of threads corresponding to each dimension, and the performing the numerical adjustment on the preset thread distribution information according to a preset adjustment rule includes:

4. The method according to claim 3, wherein the numerically adjusting the preset thread distribution information based on the adjusted thread number corresponding to the dimension to be adjusted comprises:

5. The method according to claim 1, wherein the determining, according to the preset thread distribution information, the number of thread blocks that can be processed by a single SM task of the GPU includes:

6. The method according to claim 1, wherein after performing image processing on the first image based on the thread distribution information to be used, further comprising:

7. The method according to any one of claims 1-6, wherein the image processing is image convolution processing.

8. An apparatus for image processing, the apparatus comprising:

9. The apparatus of claim 8, wherein the determining module is configured to:

10. The apparatus of claim 9, wherein the thread blocks are multidimensional thread blocks, wherein the thread distribution information comprises a number of threads corresponding to each dimension, and wherein the determining module is configured to:

11. The apparatus of claim 10, wherein the determining module is configured to:

12. The apparatus of claim 8, wherein the determining module is configured to:

13. The apparatus of claim 8, wherein the execution module is further configured to:

14. The apparatus according to any of claims 8-13, wherein the image processing is image convolution processing.