WO2023279740A1

WO2023279740A1 - Image processing method and apparatus, and electronic device and storage medium

Info

Publication number: WO2023279740A1
Application number: PCT/CN2022/078439
Authority: WO
Inventors: 刘宇玺
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-07-09
Filing date: 2022-02-28
Publication date: 2023-01-12
Also published as: CN113378863B; CN113378863A

Abstract

The present disclosure relates to an image processing method and apparatus, and an electronic device and a storage medium. The method comprises: extracting image features from a target image, so as to obtain a feature map for representing the image features; calculating an index position of a feature value in the feature map and/or a convolution kernel by means of a plurality of first threads; and controlling a plurality of second threads to read, according to the index position, the feature value in the feature map and/or the convolution kernel, and performing convolution processing by using the read feature value, so as to obtain a convolution feature, wherein the convolution feature is used for representing an extraction result of the image feature. By means of the embodiments of the present disclosure, the efficiency of a convolution operation during image processing can be improved.

Description

An image processing method and device, electronic equipment, and storage medium

This application claims the priority of the Chinese patent application filed on July 09, 2021, with the application number 202110779002.5, and the title of the invention is "An image processing method and device, electronic equipment, and storage medium", the entire contents of which are incorporated herein by reference. Applying.

technical field

The present disclosure relates to the field of computer technology, and in particular to an image processing method and device, electronic equipment, storage media and computer program products.

Background technique

Graphics Processing Unit (GPU) is widely used as a hardware accelerator in the field of high-performance computing. Especially in recent years, GPUs have been widely used in the field of artificial intelligence (AI), especially in the field of deep learning. The massive data to be processed in the training and reasoning process of deep learning is accelerated by the GPU.

The characteristics of the image can be represented in the form of a matrix, and each value in the matrix represents the pixel at the corresponding position in the image. By performing matrix multiplication and accumulation operations on the matrix, the convolution of the matrix can be realized.

Contents of the invention

The disclosure proposes an image processing technical solution.

According to an aspect of the present disclosure, an image processing method is provided, including:

Extracting the image features in the target image to obtain a feature map used to characterize the image features;

Computing index positions of feature values in feature maps and/or convolution kernels by multiple first threads;

controlling a plurality of second threads to read feature maps and/or feature values in convolution kernels according to the index positions, and perform convolution processing using the read feature values to obtain convolution features, and the convolution The feature is used to characterize the extraction result of the image feature.

In a possible implementation manner, the convolution process is performed using the read feature value to obtain the convolution feature, including:

Arrange the read eigenvalues in the channel K dimension of the matrix multiply-add operation MMA instruction, and perform matrix multiply-add operations to obtain convolution features.

In a possible implementation manner, the calculation of the index position of the feature map and/or the feature value in the convolution kernel through multiple first threads includes:

According to the sequence of the identification IDs of the at least two threads, at least two threads are controlled to respectively calculate index positions of feature values in feature maps and/or convolution kernels.

In a possible implementation manner, at the index position of the feature map and/or the feature value in the convolution kernel calculated by multiple first threads, the method further includes:

Load the characteristic value of the index position into the shared memory SMem for use by each thread.

In a possible implementation manner, the controlling multiple second threads to read feature values in feature maps and/or convolution kernels according to the index positions includes:

controlling the plurality of second threads to read the feature values at the index positions into respective registers;

The convolution process is performed using the read eigenvalues to obtain convolution features, including:

Each feature value in the register is used to perform convolution processing to obtain convolution features.

In a possible implementation, after the image features in the target image are extracted, the method further includes:

Determining multiple first sizes of feature blocks that can be processed in a single thread warp Warp, and generating a graphics processor kernel function for dividing feature blocks based on the first size; wherein, the maximum value of the first size is based on The register capacity is determined, the minimum value of the first size is the size of the smallest matrix unit calculated by the matrix multiply-add operation instruction, and the values of the various first sizes are multiples of the minimum value.

Determining multiple second sizes of feature blocks that can be processed in a single thread block TB, and generating a graphics processor kernel function for dividing feature blocks based on the second size; wherein, the minimum value of the second size is The first size of a feature block that can be processed by a single warp warp, the second size is a multiple of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in a TB.

According to an aspect of the present disclosure, an image processing device is provided, including:

An extraction unit is used to extract image features in the target image to obtain a feature map for representing image features;

An index calculation unit is used to calculate the index position of the feature value in the feature map and/or the convolution kernel through a plurality of first threads;

A convolution processing unit, configured to control multiple second threads to read feature maps and/or feature values in the convolution kernel according to the index positions, and perform convolution processing using the read feature values to obtain convolution Convolution feature, the convolution feature is used to characterize the extraction result of the image feature.

In a possible implementation manner, the convolution processing unit is configured to arrange the read eigenvalues in the channel K dimension of the matrix multiply-add operation MMA instruction, and perform matrix multiply-add operations to obtain convolution features.

In a possible implementation manner, the index calculation unit is configured to control at least two threads to respectively calculate the index positions of the feature maps and/or feature values in the convolution kernel according to the order of the identification IDs of the at least two threads .

In a possible implementation manner, the device further includes:

The loading unit is configured to load the characteristic value of the index position into the shared memory SMem for use by each thread.

In a possible implementation manner, the convolution processing unit is configured to control the plurality of second threads to read the eigenvalues at the index positions into respective registers, and use each eigenvalue in the registers Perform convolution processing to obtain convolution features.

In a possible implementation manner, the device further includes:

The first size determination unit is configured to determine multiple first sizes of feature blocks that can be processed in a single thread warp, and generate a graphics processor kernel function for dividing feature blocks based on the first size; wherein, the The maximum value of the first size is determined according to the register capacity, the minimum value of the first size is the size of the smallest matrix unit calculated by the matrix multiplication and addition operation instruction, and the values of the various first sizes are multiples of the minimum value .

In a possible implementation manner, the device further includes:

The second size determination unit is configured to determine multiple second sizes of feature blocks that can be processed in a single thread block TB, and generate a graphics processor kernel function for dividing feature blocks based on the second size; wherein, the The minimum value of the second size is the first size of a feature block that can be processed by a single thread warp, the second size is a multiple of the first size, and the maximum value of the second size is based on the capacity of the shared memory and The upper limit of the number of threads in a TB is determined.

According to an aspect of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above-mentioned method.

According to one aspect of the present disclosure, there is provided a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.

According to an aspect of the present disclosure, there is provided a computer program product, including computer readable codes, or a non-volatile computer readable storage medium bearing computer readable codes, when the computer readable codes are stored in an electronic device When running in the processor, the processor in the electronic device is used to implement the above method.

In the embodiment of the present disclosure, multiple first threads are used to calculate the index position of the feature value in the feature map and/convolution kernel; multiple second threads are controlled to read the feature map and/convolution kernel according to the index position eigenvalues, and use the read eigenvalues to perform convolution processing to obtain convolution features. Therefore, for the matrix multiplication convolution method, for the convolution type with large images and few channels, due to the small number of channels, in order to make full use of computing resources, the feature map and/or convolution can be pre-calculated through multiple first scenes The index position of the eigenvalue in the product kernel, when the index position of the eigenvalue is known, multiple second threads can read the feature map and/or the eigenvalue in the convolution kernel according to the index position for calculation, instead of In one calculation, the indexes of the feature values calculated by all threads point to the same point of the feature map and/convolution kernel in each channel, which reduces the situation of filling data 0 because the index position is not known and the data cannot be read, so as to fully Using each thread for convolution calculation reduces the waste of GPU resources and improves the efficiency of convolution operations.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings here are incorporated into the description and constitute a part of the present description. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solution of the present disclosure.

FIG. 1 shows a schematic diagram of feature map dimensions according to an embodiment of the disclosure.

FIG. 2 shows a schematic diagram of a convolution process of a matrix multiplication operation according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a process of reading feature values by a convolution operation according to an embodiment of the present disclosure.

FIG. 4 shows a flowchart of an image processing method according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of an application scenario of an image processing method according to a disclosed embodiment.

Fig. 6 shows a block diagram of an image processing device according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

Fig. 8 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

detailed description

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present disclosure may be practiced without some of the specific details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.

As described in the background technology, the efficiency of the convolution operation in the related art needs to be improved. In order to improve the efficiency of the convolution operation, the convolution operation can be performed based on a tensor computing unit (Tensor Core). Tensor Core is a matrix multiply-accumulate computing unit, which can complete multiple multiply-accumulate operations in one cycle, and can achieve very high computing performance. For the A100 model GPU, the computing performance of Tensor Core's int8 precision can even reach 624TOPS. The increase in chip computing power is mainly to accelerate high-frequency calculation-intensive operators such as convolution/matrix multiplication, but it also brings about the difficulty of how to efficiently use computing power, that is, how to efficiently implement convolution operations on Tensor Core. son.

A typical convolutional neural network (CNN) consists of an input layer (Input), a convolutional layer (Conv), an activation function (Relu), and a fully connected layer (FC). Among them, the original image data passes through the network, gradually extracts its underlying features, and finally forms high-level semantic features.

The convolution operation is the process of multiplying and adding the feature map (Feature Map) of the image and the filter kernel (Filter Kernel) to extract the image information. The convolution operation is the most time-consuming operation in the neural network. Most of the time overhead of the deep learning model is on the convolution operator. The performance of the convolution operator has an important impact on the program performance.

According to the size of the feature map (Feature Map), the convolution type can be divided into a large image less channel type and a small image multi channel type, as shown in Figure 1. The three dimensions of the feature map "length × width × channel" are marked on the top of the feature map, such as 224×224×3, 56×56×64, etc. The convolution kernel size "length × width" is marked below the convolution kernel, such as 7x7, 1x1, etc.

Large image and few channel type: This type of convolution generally exists in the initial stage of the neural network. It is characterized by a large length and width of the feature map (ie, a large image), such as 224, but the number of channels is relatively small (ie, more channel), such as 3. This type of convolution is a memory-intensive operation, and the computing power of Tensor Core cannot be fully utilized for this type.

Small image multi-channel type: This type of convolution exists in the middle and end stages of the neural network. It is characterized by a small length and width of the feature map (ie small image), such as 56/28/14, etc., but the number of channels Relatively large (that is, multi-channel), such as 64/128/256, etc. This type of convolution is a computationally intensive operation, which is very suitable for utilizing the computing power of Tensor Core.

Currently, the convolution algorithm used on TensorCore is mainly an implicit matrix multiplication algorithm. The calculation form of matrix multiplication can be expressed as the multiplication of two matrices A of M×K and matrix B of K×N, and the Kth element of each row in A corresponds to the Kth element of the corresponding column in B Multiply to get the process of matrix C of M×N.

The programming model of GPU often includes three layers: thread network (Grid), thread block (Thread Block, TB) and thread (Thread); among them, thread block is the basic unit of task allocation, and enough thread blocks can ensure that the GPU hardware The computing unit is fully utilized. In the embodiment of the present disclosure, the matrix multiplication uses a block technology to divide the entire large matrix multiplication task into multiple small matrix multiplication tasks, and then assign each small task to a different thread block for execution. For example, the matrix multiplication operation will divide the task of the matrix C (M×N dimension), and each thread block calculates a small matrix block of Mtile×Ntile, where Mtile is the size of the M-dimensional feature block, and Ntile is the size of the N-dimensional feature block. size. For example, when M=N=1024, Mtile=Ntile=128 can be divided into blocks in M and N dimensions, so that a total of (1024/128)x(1024/128)=64 feature blocks are generated. For a T4 model GPU, at least 40 thread blocks are required to use all GPU computing units SM (Streaming Multiprocessor).

However, after converting the convolution type with large images and few channels into matrix multiplication, the size of matrix multiplication is shown in Figure 2. The M dimension of matrix A and the N dimension of matrix B are large, and the generated feature blocks are sufficient to utilize GPU Hardware resources; while the size of the K dimension is very small, in the GPU, the convolution operation will be performed through the issued matrix multiply and accumulate (MMA) instruction, for example, the processing of an MMA instruction The minimum matrix unit is 16×8 in M×N dimension and 8 in K dimension. Therefore, when the size of the K dimension is very small, it may even be smaller than the K size of M×N×K in the MMA instruction.

Here we take the MMA instruction of the T4 model GPU at the FP16 level of precision as an example. The size of M×N×K is 16×8×8; when the number of channels of the convolution kernel is less than 8, it is necessary to fill 0 at the end of the channel. Only after it reaches a multiple of 8 can it be sent to the Tensor Core for execution. The more 0s are filled at the end of the channel, the more redundant redundant operations are invalid, and the lower the utilization rate of Tensor Core is.

Figure 3 shows the problem when a convolution kernel with a size of 3×3 and a channel number of 2 is executed by an MMA instruction with K=8. Due to the limitation of GPU using Single Instruction Multiple Threads (Single Instruction Multiple Threads, SIMT) programming mode, different threads belonging to the same thread warp will execute the same instruction, so the indexes calculated by threads T0, T1, T2, and T3 all point to The first point F(0,0) of the convolution kernel in each channel. However, since the number of channels of the convolution kernel is only 2, and the size of the K direction of the MMA instruction is 8, it is necessary to add 6 0s in the K direction where the value cannot be read in the MMA instruction, and one thread reads 2 data. Then only thread T0 reads the data, and the data read by T1, T2, and T3 are all invalid padding data 0. This will cause the utilization rate of Tensor Core to be only 25%, resulting in a serious waste of resources. The convolution sliding window operation (Sliding Window) will traverse all the points of the convolution kernel in turn, from F(0,0), F(0,1) to F(2,2), a total of 9 times In order to traverse the 3x3 convolution kernel.

The subject of execution of the steps of the image processing method may be executed by hardware, or executed by a processor running computer executable codes. In a possible implementation, the image processing method may be executed by electronic equipment such as a terminal device or a server, and the terminal device may be user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless Phones, personal digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, vehicle-mounted devices, wearable devices, etc., the method can be realized by calling the computer-readable instructions stored in the memory by the processor.

For ease of description, in one or more embodiments of this specification, the execution subject of the image processing method may be a graphics processing unit (graphics processing unit, GPU). way to introduce. It can be understood that the execution subject of the method is the GPU, which is only an exemplary description, and should not be understood as a limitation of the method.

FIG. 4 shows a flowchart of an image processing method according to an embodiment of the present disclosure. As shown in FIG. 4, the image processing method includes:

In step S11, image features in the target image are extracted to obtain a feature map for characterizing image features.

The expression of an image in computer technology can be a matrix composed of pixel values, so the analysis of the image can be the analysis of the matrix representing the pixel values of the image.

The image feature here may be used to represent the pixel value of the image, and the image feature may be a matrix composed of the pixel values of the image, or may be an image feature obtained after multiple convolution operations, which is not limited in the present disclosure.

In step S12, index positions of feature values in feature maps and/or convolution kernels are calculated by multiple first threads.

The specific form of convolution kernel can be a matrix, which contains three dimensions "length × width × channel". The eigenvalues in this matrix are the weights during convolution. Through this weight, it can be extracted by convolution operation. desired image features and suppress other image features. The specific representation of the feature map is also a matrix, which contains three dimensions of "length × width × channel". The eigenvalues in this matrix are used to represent the pixel values in the image.

When data is stored in the storage space, there will be an index position, which is used to represent the storage location of the data in the storage space. In the embodiment of the present disclosure, the feature value in the storage space can be read directly according to the index position to perform convolution calculation, without expanding the feature map into a matrix and then performing convolution calculation, which can save memory resources.

For the convolution of large images with few channels, due to the small number of channels, in order to make full use of computing resources, multiple first threads can sequentially read feature values in the length and width directions of feature maps and/or convolution kernels , to make full use of each thread to calculate the index position. Therefore, a thread order can be preset as the order in which the threads calculate the index positions in the feature map and/or the length and width directions of the convolution kernel, and each first thread can calculate the index positions of the feature values in parallel, and each first thread Compute the index positions of the different eigenvalues.

For example, taking the calculation of the index position of the convolution kernel as an example, for the four threads numbered 0, 1, 2, and 3, their numbers can be used as the thread order, then, thread 0 is in the length and width directions of the convolution kernel Calculate the index position of the first eigenvalue, thread No. 1 calculates the index position of the second eigenvalue, thread No. 2 calculates the index position of the third eigenvalue, thread No. 3 calculates the index position of the fourth eigenvalue.

It should be noted that each time a thread reads a characteristic value, it can read values on two channels. For example, for a convolution kernel with a length and width of 3x3 and a number of channels of 2, according to the sequence in step S12, in the case of 1 thread computing 2 channels, it only needs to execute the MMA instruction 3 times to calculate all the eigenvalues Calculations are more efficient.

In step S13, multiple second threads are controlled to read feature maps and/or feature values in the convolution kernel according to the index positions, and perform convolution processing using the read feature values to obtain convolution features, The convolution feature is used to characterize the extraction result of the image feature.

After determining the index position, since the index position is known, each second thread can directly read each feature value in the feature map and/convolution kernel according to the index position when performing calculations, and the second thread is in the known In the case of the index position of the eigenvalue, the eigenvalue can be directly read according to the index position as a parameter in the convolution calculation instruction, so the values required for each dimension calculation in a convolution calculation instruction can be filled as much as possible, reducing It is necessary to add 0.

After obtaining the eigenvalues in the storage space, the read eigenvalues can be used to perform convolution processing to obtain convolution features. The specific convolution processing process can be convolution processing based on matrix multiplication, which is especially suitable for It is suitable for convolution processing of large image and less channel type, and the processing efficiency is high.

It should be noted that the first thread and the second thread may be the same thread or different threads. It can be understood that the "first" and "second" in the embodiments of the present disclosure are used to distinguish the described It should not be understood as other restrictions on the order of describing objects, indicating or implying relative importance, etc.

In a possible implementation manner, performing convolution processing using the read eigenvalues to obtain convolution features includes: adding the read eigenvalues to the channel dimension K of the matrix multiplication and addition operation MMA instruction Arrange and perform matrix multiplication and addition operations to obtain convolution features.

In the matrix multiplication operation, the MMA instruction is used to perform the matrix multiplication operation. For the MMA instruction, the minimum size in the K direction of the channel dimension is 8. That is to say, when executing an MMA execution, the K dimension can calculate 8 eigenvalue, then, in the case of knowing the index position of the feature map and/or convolution kernel feature value, the feature values of the 8 channels can be read in turn according to the index position, as the value calculated by the MMA instruction, so that it can be fully utilized The computing power of the second thread reduces the waste of GPU resources and improves the efficiency of convolution operations.

In a possible implementation manner, calculating the feature map and/the index position of the feature value in the convolution kernel through multiple first threads includes: according to the feature map and/the feature value in the convolution kernel in the feature map and/ The position in the convolution kernel, and the feature map and/or the data arrangement rule of each feature value in the convolution kernel in the storage space determine the index position of each feature value in the storage space.

When the data is stored in the storage space, it will follow certain arrangement rules. For the feature matrix, there are two common arrangement rules: NCHW arrangement and NHWC arrangement, where N represents the number of feature maps, C represents channels, and H Represents the height (length) of the matrix, and W represents the width. The rule of NCHW arrangement is to arrange the values of the matrix according to the priority order of [N,C,H,W], and the NHWC arrangement is to arrange the values according to the priority order of [N,H,W,C]. Arrange the values of the matrix.

Taking the convolution kernel as an example, the channel, height, and width of each eigenvalue of the convolution kernel are known, that is, the position of the eigenvalue in the convolution kernel in the convolution kernel is known. Knowing the data arrangement rules of the eigenvalues of the convolution kernel in the data storage space, the position of each eigenvalue in the convolution kernel and the data arrangement rules in the data storage space can be used to determine each The index position of the feature value in the storage space.

In the embodiment of the present disclosure, according to the position of the feature value in the feature map and/convolution kernel in the feature map and/convolution kernel, and the storage space of each feature value in the feature map and/convolution kernel The data arrangement rules determine the index position of each feature value in the storage space. In this way, the feature map to be read by each thread and/or the index position of the feature value in the convolution kernel can be accurately calculated.

In a possible implementation manner, the calculating the index position of the feature map and/or the feature value in the convolution kernel through multiple first threads includes: controlling at least two Threads compute index positions of feature maps and/or feature values in kernels respectively.

Taking the calculation of the eigenvalues of the convolution kernel as an example, each thread in the GPU has an identification ID, which is used to distinguish different threads. The ID can be: T0, T1, T2, T3, for example, then The order of thread IDs can be the order of [T0, T1, T2, T3] in ascending order of numbers. It should be noted that this order is a circular order, that is, the order of T3 will continue to be connected after [ T0, T1, T2, T3] until the index position of the feature value in the convolution kernel is calculated.

In the embodiment of the present disclosure, since the thread itself has an ID, each thread is controlled to calculate the index position of the feature map and/or the feature value in the convolution kernel according to the order of the identification ID of each thread, so that each thread parallelizes each The index position of the feature value is calculated, which improves the efficiency of the index position calculation.

In a possible implementation manner, the calculating the feature map and/the index position of the feature value in the convolution kernel through multiple first threads includes: loading the feature value of the index position into a shared memory (Shared Memory , SMem) for use by each thread.

Although the size of the shared memory is limited, it has high read and write speed and bandwidth. Therefore, considering that the value of the index position (index value) occupies a small amount of memory, then, in order to improve the efficiency of reading and writing the index value, you can use The value of the index position is stored in the SMem, thereby improving the efficiency of the convolution operation to improve the efficiency of the image processing operation.

In a possible implementation manner, the controlling the multiple second threads to read the feature values in the convolution kernel according to the index positions includes: controlling the multiple second threads to read the feature values at the index positions input into respective registers; said using the read feature values to perform convolution processing to obtain convolution features includes: using each feature value in the registers to perform convolution processing to obtain convolution features.

When using Tensor Core, the MMA instruction is often used together with the ldmatrix instruction. Usually, the data of matrix A and matrix B required for MMA instruction calculation will be read from the global memory space (Global Memory, GMem) and placed in SMem, and then use the ldmatrix instruction to store the data of matrix A and matrix B according to a specific matrix shape. Data is read into the registers of each thread. The advantage of putting data into SMem is that there is data sharing between threads, which reduces the delay of reading data from GMem. However, for matrix multiplication operations with large images and few channels, there is little convolution data that needs to be shared. , Putting the data in the SMem cache will increase the delay of reading and writing SMem.

Therefore, in the embodiment of the present disclosure, the data required in the MMA instruction is directly read into the registers of each thread, making full use of the characteristics of large image and less channel type convolution, and reducing the delay of data reading.

In the embodiment of the present disclosure, in the process of performing the convolution operation, the matrix of the convolution operation can be divided into blocks, the feature matrix can be divided into multiple feature blocks, and then the multiple feature blocks can be operated in parallel to obtain To improve the efficiency of convolution, specifically, it is possible to determine various possible feature block sizes of TB, thread warp, and K dimension, and finally generate a variety of corresponding graphics processor kernel function kernels, which are convenient for later selection according to the size of image features A suitable kernel performs convolution processing on image features. The three dimensions are described in detail below.

In a possible implementation, after the image features in the target image are extracted, the method further includes: determining multiple first sizes of feature blocks that can be processed in a single warp, and based on the The first size is used to generate a graphics processor kernel function for feature block division; wherein, the maximum value of the first size is determined according to the register capacity, and the minimum value of the first size is the minimum calculated by the matrix multiply-add operation instruction The size of the matrix unit, the values of the multiple first sizes are multiples of the minimum value.

In the GPU, the convolution operation is performed through the issued matrix multiply and accumulate (MMA) instruction. For example, the smallest matrix unit processed by an MMA instruction is 16×8 in the M×N dimension , which is 8 in the K dimension.

In a single Warp, the size of the smallest matrix unit that can be processed is the size of the smallest matrix unit that can be processed by the MMA instruction, and since the operation of the instruction is accumulated in the form of an exponential power of 2, the characteristics that can be processed by a single Warp The size of the block is the size of the smallest matrix unit multiplied by the power of 2, N can be 8, 16, 32, 64, M can be 16, 32, 64, 128, then the obtained value of M×N is as follows Table 1 shows.

Considering the limitation of the register capacity, the size of the feature block cannot be increased infinitely, and its maximum value can be determined according to the register capacity. Due to the limitation of the register capacity, the Warp feature block of 128x64 cannot be stored in the register. Therefore, the feature block The maximum value of the first size is 128×32 or 64×64, as shown in Table 1 for details.

Table 1 Warp-level characteristic block size

In the embodiment of the present disclosure, the first size of all feature blocks can be traversed in the Warp dimension, and then the graphics processor kernel function kernel can be generated based on the first size, and the kernel obtained in this way can be applied to image features of various sizes Segmentation, the resulting kernel has high universality.

In a possible implementation manner, after the image features in the target image are extracted, the method further includes: determining multiple second sizes of feature blocks that can be processed in a single thread block TB, and based on the The second size is used to generate a graphics processor kernel function for feature block division; wherein, the minimum value of the second size is the first size of a feature block that can be processed by a single thread warp, and the second size is the multiples of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in a TB.

A TB contains one or more warps, then the minimum value of the second size is the first size of the feature block that can be processed by a single thread warp, and the second size can also be a multiple of the first size, which is Powers of 2, ie 2, 4. The specific values of M _TB and NT _TB in Table 3 are multiples of the first size.

In some GPUs, a TB can often have up to 1024 threads, that is, 16 Warps. However, in GPUs, when the number of threads is 128-512, the calculation efficiency is higher, so 16 Warps are often not run, so , in order to ensure GPU computing efficiency, a maximum of 8 warps can be run in one TB, that is, the maximum value of M _TB ×N _TB is 2×4 or 4×2, as shown in Table 2.

M _TB\N _TB M _TB \N _TB	11	22	44
11	1x11x1	1x21x2	1x41x4
22	2x12x1	2x22x2	2x42x4
44	4x14x1	4x24x2	\\

Table 2 Feature block size at TB level. The numbers in the table represent the multiples of the corresponding Warp-level feature block size

In the embodiment of the present disclosure, the second size of all feature blocks can be traversed in the TB dimension, and then the graphics processor kernel function kernel can be generated based on the second size, and the kernel obtained in this way can be applied to image features of various sizes Segmentation, the resulting kernel has high universality.

As mentioned above, in a single Warp, the size of the smallest matrix unit that can be processed is the size of the smallest matrix unit that can be processed by the MMA instruction. For example, the smallest matrix unit processed by an MMA instruction is in the M×N dimension is 16×8, which is 8 in the K dimension. Then, various possible sizes of the K dimension in a single TB include k8, k16, k32, etc., as shown in Table 3.

K

8

16

32

Table 3 Block size of K dimension

In the embodiment of the present disclosure, the third size of the group can be traversed on the K dimension of the feature block, and then the kernel function kernel of the graphics processor can be generated based on various possible third sizes, and the kernel obtained in this way can be applied to each The image features of different sizes are segmented, and the obtained kernel has high universality.

Please refer to FIG. 5, which is a schematic diagram of an application scenario of the image processing method of the embodiment of the present disclosure. In this application scenario, the index position of each feature value in the 3×3 convolution kernel is pre-calculated: F(0,0 ),F(0,1),F(0,2),F(1,0),F(1,1),F(1,2),F(2,0),F(2,1) , F(2,2), and put it into SMem; then threads T0, T1, T2, T3 read the pre-calculated index from SMem in turn, and only the F( 0,0), F(0,1), F(0,2) and F(1,0), where one thread reads the feature values of the 2-layer channel at the same length and width position, as shown in Figure 5 , C0 represents the 0th layer of the channel, C1 represents the first layer of the channel, T0 reads the eigenvalues of the two channels at F(0,0), and T1 reads the two channels at F(0,1) Eigenvalues, T2 reads the eigenvalues of the two channels at F(0,2), T3 reads the eigenvalues of the two channels at F(1,0), and then reads the corresponding convolution kernel data into in their respective registers. One MMA instruction can be executed by four threads T0, T1, T2, and T3. After the four threads read the data of the 8 channels shown in the figure, the MMA instruction is executed once. There is no need to fill in invalid data during this execution. 0 data, which improves the utilization of Tensor Core.

It can be seen from Figure 5 that in the case of 4 threads performing image processing, the 4 threads need to perform 3 MMA tasks in total, and then they can traverse the 3x3 convolution kernel. For the first time, the dual-channel F(0, 0), F(0,1), F(0,2), F(1,0) for processing, a total of 8 channels of data, which can fill the minimum K dimension 8 of the MMA instruction; the second time for the dual channel F(1,1), F(1,2), F(2,0), F(2,1) are processed, a total of 8 channels of data, which can fill the minimum K dimension 8 of the MMA instruction, the third time The two-channel F(2,2) is processed, and the data of two channels in total cannot fill the minimum K dimension of 8 in the MMA instruction, so it is filled with 6 zeros to make up. Therefore, compared with the operation method in Fig. 3, in Fig. 3, six 0s need to be filled for one execution of the MMA instruction, and a total of 9 MMA instructions need to be executed, that is, a total of 54 0s need to be filled. Obviously, compared with As far as the technology in Figure 3 is concerned, the image processing method provided by the embodiment of the present disclosure can reduce the number of invalid padding zeros and improve the utilization rate of Tensor Core when processing images with large images and few channels.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can all be combined with each other to form a combined embodiment without violating the principle and logic. Due to space limitations, this disclosure will not repeat them. Those skilled in the art can understand that, in the above method in the specific implementation manner, the specific execution order of each step should be determined according to its function and possible internal logic.

In addition, the present disclosure also provides image processing devices, electronic equipment, computer-readable storage media, and programs, all of which can be used to implement any image processing method provided in the present disclosure. For the corresponding technical solutions and descriptions, refer to the corresponding records in the method section ,No longer.

FIG. 6 shows a block diagram of an image processing device according to an embodiment of the present disclosure. As shown in FIG. 6, the device 60 includes:

An extraction unit 61, configured to extract image features in the target image to obtain a feature map for characterizing image features;

An index calculation unit 62, configured to calculate the index position of the feature value in the feature map and/or the convolution kernel through a plurality of first threads;

The convolution processing unit 63 is configured to control multiple second threads to read feature maps and/or feature values in the convolution kernel according to the index positions, and perform convolution processing using the read feature values to obtain A convolution feature, the convolution feature is used to characterize the extraction result of the image feature.

In a possible implementation, the convolution processing unit is configured to arrange the read eigenvalues in the channel K dimension of the matrix multiplication and addition operation MMA instruction, and perform matrix multiplication and addition operations to obtain convolution features.

In a possible implementation manner, the device further includes:

In some embodiments, the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the above method embodiments, and its specific implementation and technical effects can refer to the descriptions of the above method embodiments, for It is concise and will not be repeated here.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor. Computer readable storage media may be volatile or nonvolatile computer readable storage media.

An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, when the computer-readable codes are stored in a processor of an electronic device When running in the electronic device, the processor in the electronic device executes the above method.

Electronic devices may be provided as terminals, servers, or other forms of devices.

FIG. 7 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.

7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814 , and the communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .

The memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The power supply component 806 provides power to various components of the electronic device 800 . Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .

The multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 . In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 . For example, the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.

FIG. 8 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, electronic device 1900 may be provided as a server. Referring to FIG. 8 , electronic device 1900 includes processing component 1922 , which further includes one or more processors, and a memory resource represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above method.

Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 . The electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server ^TM ), the graphical user interface-based operating system (Mac OS X ^TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix ^™ ), a free and open source Unix-like operating system (Linux ^™ ), an open source Unix-like operating system (FreeBSD ^™ ), or the like.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.

The present disclosure can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

The computer program product can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.

Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

An image processing method, characterized in that, comprising:

Extracting the image features in the target image to obtain a feature map used to characterize the image features;

Computing index positions of feature values in feature maps and/or convolution kernels by multiple first threads;

controlling a plurality of second threads to read feature maps and/or feature values in convolution kernels according to the index positions, and perform convolution processing using the read feature values to obtain convolution features, and the convolution The feature is used to characterize the extraction result of the image feature.
The method according to claim 1, wherein said performing convolution processing using the read eigenvalues to obtain convolution features includes:

Arrange the read eigenvalues in the channel K dimension of the matrix multiply-add operation MMA instruction, and perform matrix multiply-add operations to obtain convolution features.
The method according to claim 1 or 2, wherein the calculation of the feature map and/the index position of the feature value in the convolution kernel through a plurality of first threads comprises:

According to the sequence of the identification IDs of the at least two threads, at least two threads are controlled to respectively calculate index positions of feature values in feature maps and/or convolution kernels.
According to the method according to any one of claims 1-3, it is characterized in that, at the index position of the feature value in the feature map and/convolution kernel calculated by multiple first threads, the method further comprises:

Load the characteristic value of the index position into the shared memory SMem for thread use.
The method according to any one of claims 1-4, wherein the controlling multiple second threads to read feature maps and/or feature values in convolution kernels according to the index positions includes:

controlling the plurality of second threads to read the feature values at the index positions into respective registers;

The convolution process is performed using the read eigenvalues to obtain convolution features, including:

Each feature value in the register is used to perform convolution processing to obtain convolution features.
According to the method according to any one of claims 1-5, it is characterized in that, after the image features in the target image are extracted, the method further comprises:

Determining multiple first sizes of feature blocks that can be processed in a single thread warp Warp, and generating a graphics processor kernel function for dividing feature blocks based on the first size; wherein, the maximum value of the first size is based on The register capacity is determined, the minimum value of the first size is the size of the smallest matrix unit calculated by the matrix multiply-add operation instruction, and the values of the various first sizes are multiples of the minimum value.
According to the method according to any one of claims 1-6, it is characterized in that, after the image features in the target image are extracted, the method further comprises:

Determining multiple second sizes of feature blocks that can be processed in a single thread block TB, and generating a graphics processor kernel function for dividing feature blocks based on the second size; wherein, the minimum value of the second size is The first size of a feature block that can be processed by a single warp warp, the second size is a multiple of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in a TB.
An image processing device, characterized in that it comprises:

An extraction unit is used to extract image features in the target image to obtain a feature map for representing image features;

An index calculation unit is used to calculate the index position of the feature value in the feature map and/or the convolution kernel through a plurality of first threads;

A convolution processing unit, configured to control multiple second threads to read feature maps and/or feature values in the convolution kernel according to the index positions, and perform convolution processing using the read feature values to obtain convolution Convolution feature, the convolution feature is used to characterize the extraction result of the image feature.
An electronic device, characterized in that it comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1-7.
A computer-readable storage medium, on which computer program instructions are stored, wherein, when the computer program instructions are executed by a processor, the method according to any one of claims 1 to 7 is implemented.
A computer program product, comprising computer readable codes, or a non-volatile computer readable storage medium bearing computer readable codes, when the computer readable codes are run in a processor of an electronic device, the electronic A processor in the device is configured to implement the method of any one of claims 1-7.