CN113378863A

CN113378863A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113378863A
Application number: CN202110779002.5A
Authority: CN
Inventors: 刘宇玺
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-09-10
Anticipated expiration: 2041-07-09
Also published as: WO2023279740A1; CN113378863B

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: extracting image features in the target image to obtain a feature map for representing the image features; calculating index positions of feature values in the feature map and/or the convolution kernel through a plurality of first threads; and controlling a plurality of second threads to read the characteristic values in the characteristic diagram and/or the convolution kernel according to the index positions, and performing convolution processing by using the read characteristic values to obtain convolution characteristics, wherein the convolution characteristics are used for representing the extraction result of the image characteristics. The embodiment of the disclosure can improve the efficiency of the convolution operation in the image processing.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Graphics Processing Units (GPUs) are widely used as hardware accelerators in the field of high-performance computing. In recent years, GPUs have been widely used in the field of Artificial Intelligence (AI), and in particular, in the field of deep learning. Massive data to be processed in the deep learning training and reasoning process is accelerated by the GPU.

The characteristics of the image can be represented in a matrix form, each value in the matrix represents a pixel point at a corresponding position in the image, and the convolution of the matrix can be realized by carrying out matrix multiplication accumulation operation on the matrix. However, in the related art, the efficiency of the convolution operation is to be improved.

Disclosure of Invention

The present disclosure proposes an image processing technical solution.

According to an aspect of the present disclosure, there is provided an image processing method including:

extracting image features in the target image to obtain a feature map for representing the image features;

calculating index positions of feature values in the feature map and/or the convolution kernel through a plurality of first threads;

and controlling a plurality of second threads to read the characteristic values in the characteristic diagram and/or the convolution kernel according to the index positions, and performing convolution processing by using the read characteristic values to obtain convolution characteristics, wherein the convolution characteristics are used for representing the extraction result of the image characteristics.

In a possible implementation manner, the performing convolution processing by using the read feature values to obtain convolution features includes:

and arranging the read eigenvalues in the dimension of a channel K of an MMA instruction through matrix multiplication and addition operation, and performing matrix multiplication and addition operation to obtain convolution characteristics.

In one possible implementation, the calculating, by a plurality of first threads, index positions of feature values in the feature map and/or the convolution kernel includes:

and controlling each thread to respectively calculate the index position of the characteristic value in the characteristic diagram and/or the convolution kernel according to the sequence of the identification ID of each thread.

In one possible implementation, in the computing the index position of the feature value in the feature map and/or convolution kernel by the plurality of first threads, the method further includes:

and loading the characteristic value of the index position into a shared memory SMem for each thread to use.

In one possible implementation, the controlling the plurality of second threads to read the feature values in the feature map and/or the convolution kernel according to the index position includes:

controlling the plurality of second threads to read the characteristic values at the index positions into respective registers;

performing convolution processing by using the read characteristic value to obtain convolution characteristics, including:

and carrying out convolution processing by utilizing each characteristic value in the register to obtain convolution characteristics.

In one possible implementation manner, after the extracting the image feature in the target image, the method further includes:

determining various first sizes of feature blocks which can be processed in a single thread bundle Warp, and generating a graphic processor core function for dividing the feature blocks based on the first sizes; the maximum value of the first size is determined according to the capacity of a register, the minimum value of the first size is the size of a minimum matrix unit calculated by a matrix multiply-add operation instruction, and the values of the various first sizes are multiples of the minimum value.

determining various second sizes of the feature blocks which can be processed in the single thread block TB, and generating a graphic processor core function for dividing the feature blocks based on the second sizes; the minimum value of the second size is a first size of a feature block which can be processed by a single thread beam Warp, the second size is a multiple of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in the TB.

According to an aspect of the present disclosure, there is provided an image processing apparatus including:

the extraction unit is used for extracting image features in the target image to obtain a feature map for representing the image features;

the index calculation unit is used for calculating index positions of characteristic values in the characteristic diagram and/or the convolution kernel through a plurality of first threads;

and the convolution processing unit is used for controlling a plurality of second threads to read the characteristic values in the characteristic diagram and/or the convolution kernel according to the index positions and carrying out convolution processing by using the read characteristic values to obtain convolution characteristics, wherein the convolution characteristics are used for representing the extraction result of the image characteristics.

In a possible implementation manner, the convolution processing unit is configured to arrange the read eigenvalues in a dimension of a channel K of an MMA instruction in matrix multiply add operation, and perform matrix multiply add operation to obtain a convolution characteristic.

In a possible implementation manner, the index calculation unit is configured to control each thread to calculate an index position of a feature value in the feature map and/or the convolution kernel respectively according to an order of the identification IDs of each thread.

In one possible implementation, the apparatus further includes:

and the loading unit is used for loading the characteristic value of the index position into the shared memory SMem for each thread to use.

In a possible implementation manner, the convolution processing unit is configured to control the plurality of second threads to read the feature values at the index positions into respective registers, and perform convolution processing by using the feature values in the registers to obtain convolution features.

In one possible implementation, the apparatus further includes:

a first size determination unit, configured to determine multiple first sizes of feature blocks that can be processed in a single thread bundle Warp, and generate a graphics processor core function for feature block division based on the first sizes; the maximum value of the first size is determined according to the capacity of a register, the minimum value of the first size is the size of a minimum matrix unit calculated by a matrix multiply-add operation instruction, and the values of the various first sizes are multiples of the minimum value.

In one possible implementation, the apparatus further includes:

a second size determination unit, configured to determine multiple second sizes of feature blocks that can be processed in a single thread block TB, and generate a graphics processor core function for feature block partitioning based on the second sizes; the minimum value of the second size is a first size of a feature block which can be processed by a single thread beam Warp, the second size is a multiple of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in the TB.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the disclosed embodiment, index positions of feature values in the feature map and/or convolution kernel are calculated by a plurality of first threads; and controlling a plurality of second threads to read the characteristic values in the characteristic diagram and/or the convolution kernel according to the index positions, and performing convolution processing by using the read characteristic values to obtain convolution characteristics. Therefore, for the convolution mode of matrix multiplication, for the convolution type with a large image and a small number of channels, because the number of channels is small, in order to fully utilize computing resources, the index positions of the characteristic values in the characteristic diagram and/or the convolution kernel can be pre-computed through a plurality of first fields, and under the condition that the index positions of the characteristic values are known, a plurality of second threads can read the characteristic values in the characteristic diagram and/or the convolution kernel according to the index positions to perform computation, rather than the condition that the indexes of the characteristic values computed by all the threads point to the same point of the characteristic diagram and/or the convolution kernel in each channel in one-time computation, the condition that data 0 is filled because the data cannot be read by the index positions is not known is reduced, so that the convolution computation is performed by fully utilizing each thread, the GPU resource waste is reduced, and the efficiency of convolution operation is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a schematic diagram of feature map dimensions in accordance with an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a convolution process of a matrix multiplication operation according to an embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a convolution operation read eigenvalue process according to an embodiment of the present disclosure.

Fig. 4 shows a flow chart of an image processing method according to an embodiment of the present disclosure.

Fig. 5 shows a schematic view of an application scenario of an image processing method according to a disclosed embodiment.

Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

Fig. 7 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

FIG. 8 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

As described in the background art, the efficiency of the convolution operation in the related art is to be improved, and in order to improve the efficiency of the convolution operation, the convolution operation may be performed based on a Tensor calculation unit (Tensor Core). The Tensor Core is a calculating unit for matrix multiply-accumulate, can complete multiply-accumulate operation for a plurality of times in one period, and can achieve very high calculating performance. For a100 model GPU, the computation performance of int8 precision of Tensor Core may even reach 624 TOPS. The increase of the chip computing power is mainly to accelerate high-frequency computing intensive operators such as convolution/matrix multiplication, but also brings the difficulty of how to efficiently utilize the computing power, namely how to efficiently realize convolution operators on the Tensor Core.

A typical Convolutional Neural Network (CNN) is composed of an Input layer (Input), a convolutional layer (Conv), an activation function (Relu), a full connection layer (FC), and the like. The original image data is gradually extracted from the bottom layer characteristics through a network, and finally high-level semantic characteristics are formed.

The convolution operation is a process of extracting image information by performing a multiply-add operation on a Feature Map (Feature Map) of an image and a convolution Kernel (Filter Kernel). Convolution operation is the most time-consuming operation in a neural network, most of the time overhead of a deep learning model is on a convolution operator, and the performance of the convolution operator has an important influence on the program performance.

The convolution type can be classified into a large image-less channel type and a small image-less channel type according to the size of the Feature Map (Feature Map), as shown in fig. 1. The three dimensions "length × width × channel" of the feature map are marked above the feature map, such as 224 × 224 × 3, 56 × 56 × 64, and the like. The convolution kernel size "length × width" is labeled below the convolution kernel, such as 7x7, 1x1, and so on.

Large graph few channel type: this type of convolution typically exists at the beginning of a neural network and is characterized by a feature map that is large (i.e., a large map) in length and width, such as 224, but small in number (i.e., multiple channels), such as 3. The convolution type belongs to access intensive operation, and the computing capability of the sensor Core cannot be fully exerted for the convolution type.

Small graph multi-channel type: this type of convolution exists in the middle and end stages of a neural network and is characterized by a small length and width of the signature (i.e., a small graph), such as 56/28/14, but a large number of channels (i.e., multiple channels), such as 64/128/256, etc. The convolution type belongs to calculation intensive operation and is very suitable for exerting the calculation capability of Tensor Core.

The convolution algorithm adopted on TenscorCore at present is mainly an implicit matrix multiplication algorithm. The calculation form of the matrix multiplication can be expressed as a process of multiplying two matrixes A with the size of M multiplied by K and a matrix B with the size of K multiplied by N, wherein the Kth element of each row in A is correspondingly multiplied by the Kth element of the corresponding column in B to obtain an M multiplied by N matrix C.

The programming model for a GPU often includes three layers: a Thread net (Grid), a Thread Block (TB) and a Thread (Thread); the thread blocks are basic units for task allocation, and enough thread blocks can ensure that hardware computing units of the GPU are fully played. In the disclosed embodiment, the matrix multiplication utilizes a blocking technique, divides the whole large matrix multiplication task into a plurality of small matrix multiplication tasks, and then allocates each small matrix multiplication task to a different thread block for execution. For example, the matrix multiplication operation performs task division on a matrix C (M × N dimension), and each thread block calculates a small matrix block of Mtile × Ntile, where Mtile is the size of an M-dimension feature block and Ntile is the size of an N-dimension feature block. For example, when M-N-1024, the blocks may be partitioned according to Mtile-Ntile-128 in the M and N dimensions, which results in (1024/128) x (1024/128) 64 feature blocks in total. For a GPU model T4, at least 40 thread blocks are needed to use all GPU computing units sm (streaming multiprocessor).

However, after the convolution type of large image and few channels is converted into matrix multiplication, the size of the matrix multiplication is as shown in fig. 2, the size of M dimension of matrix a and the size of N dimension of matrix B are very large, and the generated feature block is enough to utilize the hardware resource of GPU; while the size of the K dimension is small, in the GPU, convolution is performed by a Matrix Multiply and Accumulate (MMA) instruction issued, for example, the minimum matrix unit processed by an MMA instruction is 16 × 8 in the M × N dimension and 8 in the K dimension. Therefore, with a small size in the K dimension, it is even smaller than the size of M N K in MMA instructions.

Here, taking an MMA instruction of a GPU of a T4 model at FP16 level precision as an example, the size of M × N × K is 16 × 8 × 8; when the number of channels of the convolution kernel is less than 8, 0 is filled at the end of the channel, so that the end of the channel can not be sent to the sensor Core for execution after the end of the channel is multiplied by 8. The more 0 s are filled at the end of a channel, the more redundant operations are invalidated and the lower the utilization of the Tensor Core.

Fig. 3 illustrates the problem that a convolution kernel of size 3 × 3 and channel number 2 is generated when an MMA instruction with K ═ 8 executes. Due to the limitation of the programming mode of the GPU using Single Instruction Multiple Threads (SIMT), different Threads belonging to the same thread bundle Warp execute the same Instruction, so the indexes calculated by the Threads T0, T1, T2, and T3 all point to the first point F (0,0) of the convolution kernel in each channel. However, since the number of channels of the convolution kernel is only 2, and the size of the K direction of the MMA instruction is 8, it is necessary to supplement 60 s in the K direction in which the value cannot be read in the MMA instruction, and one thread reads 2 data, so only the thread T0 reads the data, and the data read by the T1, the T2 and the T3 are all invalid padding data 0. This causes the utilization rate of the Tensor Core to be only 25%, which brings about serious resource waste. And the Sliding Window operation (Sliding Window) of convolution sequentially traverses all the points of the convolution kernel from F (0,0), F (0,1) to F (2,2), and it takes 9 times to traverse the convolution kernel of 3x 3.

In one possible implementation, the image processing method may be performed by an electronic device such as a terminal device or a server, the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory.

For convenience of description, in one or more embodiments of the present specification, an execution subject of the image processing method may be a Graphics Processing Unit (GPU), and hereinafter, an implementation of the method will be described by taking the execution subject as a GPU as an example. It is understood that the execution subject of the method is GPU, which is only an exemplary illustration and should not be understood as a limitation of the method.

Fig. 4 illustrates a flowchart of an image processing method according to an embodiment of the present disclosure, which includes, as illustrated in fig. 4:

in step S11, image features in the target image are extracted, and a feature map for characterizing the image features is obtained.

The representation of the image in computer technology may be a matrix of individual pixel values, and the analysis of the image may then be an analysis of the matrix characterizing the pixel values of the image.

The image feature may be used to represent pixel values of an image, and the image feature may be a matrix composed of the pixel values of the image, or may be an image feature obtained after a plurality of convolution operations, which is not limited in this disclosure.

In step S12, the index positions of the feature values in the feature map and/or convolution kernel are calculated by a plurality of first threads.

The convolution kernel may be embodied in a matrix including three dimensions "length × width × channel", and the feature values in the matrix are weights during convolution, and by the weights, a desired image feature may be extracted through a convolution operation, and other image features may be suppressed. The specific representation form of the feature map is also a matrix, and the three dimensions of length, width and channel are included, and the feature values in the matrix are used for representing the pixel values in the image.

When the data is stored in the storage space, there is an index position for characterizing the storage position of the data in the storage space. In the embodiment of the disclosure, the eigenvalue in the storage space can be directly read according to the index position to perform convolution calculation, and the eigen graph does not need to be expanded into a matrix to perform convolution calculation, so that the memory resource can be saved.

For convolution with a large image and a small number of channels, because of the small number of channels, in order to fully utilize computing resources, a plurality of first threads may sequentially read feature values in the length and width directions of the feature image and/or convolution kernel, so as to fully utilize each thread to perform index position computation. Therefore, a thread order may be preset as an order in which threads calculate index positions in the length and width directions of the feature map and/or convolution kernel, and each first thread may calculate index positions of feature values in parallel, each first thread calculating index positions of different feature values.

For example, taking the calculation of the index position of the convolution kernel as an example, for 4 threads numbered 0,1, 2, and 3, the numbers can be taken as the thread order, and then thread No. 0 calculates the index position of the 1 st feature value in the length and width directions of the convolution kernel, thread No. 1 calculates the index position of the 2 nd feature value, thread No. 2 calculates the index position of the 3 rd feature value, and thread No. 3 calculates the index position of the 4 th feature value.

It should be noted that each time the thread reads the feature value, the thread can read the values on 2 channels. For example, for a convolution kernel having a length and width of 3 × 3 and a number of channels of 2, when 2 channels are calculated for 1 thread in the order of step S12, all feature values can be calculated by only executing the MMA instruction 3 times, which is highly efficient.

In step S13, the plurality of second threads are controlled to read feature values in the feature map and/or convolution kernel according to the index positions, and perform convolution processing using the read feature values to obtain convolution features, where the convolution features are used to characterize the extraction result of the image features.

After the index position is determined, since the index position is known, each second thread can directly read each feature value in the feature map and/or the convolution kernel according to the index position during calculation, and when the index position of the feature value is known, the second thread can directly read the feature value according to the index position as a parameter in the convolution calculation instruction, so that values required by calculation of each dimension in the convolution calculation instruction can be fully arranged as much as possible, and the situation that 0 needs to be supplemented is reduced.

After the eigenvalue in the storage space is obtained, the read eigenvalue can be used for convolution processing to obtain convolution characteristics, the specific convolution processing process can be convolution processing based on matrix multiplication, and the method is particularly suitable for large-image few-channel type convolution processing and high in processing efficiency.

It should be noted that the first thread and the second thread may be the same thread or different threads, and it is understood that "first" and "second" in the embodiments of the present disclosure are used for distinguishing the described objects, and should not be understood as other limitations to the order in which the objects are described, indicating or implying relative importance.

In a possible implementation manner, the performing convolution processing by using the read feature values to obtain convolution features includes: and arranging the read eigenvalues in a channel dimension K of an MMA instruction through matrix multiplication and addition operation, and performing matrix multiplication and addition operation to obtain convolution characteristics.

In the matrix multiplication operation, the matrix multiplication operation is performed through an MMA instruction, and for the MMA instruction, the minimum size in the channel dimension K direction is 8, that is, when MMA is executed once, the K dimension can calculate 8 eigenvalues, so that under the condition that the index positions of the eigenvalue and/or the convolution kernel eigenvalue are known, the eigenvalues of 8 channels can be sequentially read according to the index positions to serve as numerical values calculated by the MMA instruction, so that the operation capability of the second thread can be fully utilized, the GPU resource waste is reduced, and the convolution operation efficiency is improved.

In one possible implementation, calculating the index position of the feature value in the feature map and/or convolution kernel by a plurality of first threads includes: and determining the index position of each characteristic value in the storage space according to the position of the characteristic value in the characteristic diagram and/or the convolution kernel and the data arrangement rule of each characteristic value in the characteristic diagram and/or the convolution kernel in the storage space.

When data is stored in a storage space, certain arrangement rules are followed, and for a feature matrix, there are two common arrangement rules: the NCHW and NHWC configurations, where N represents the number of signatures, C represents the channel, H represents the height (length) of the matrix, and W represents the width. The rules of NCHW arrangement are that the values of the matrix are arranged according to the priority order of [ N, C, H, W ], and the NHWC arrangement is that the values of the matrix are arranged according to the priority order of [ N, H, W, C ].

Taking the convolution kernel as an example, the channel, the height, and the width of each feature value of the convolution kernel are known, that is, the position of the feature value in the convolution kernel is known, so that, given the data arrangement rule of the feature value of the convolution kernel in the data storage space, the index position of each feature value in the storage space can be determined according to the position of each feature value in the convolution kernel and the data arrangement rule in the data storage space.

In the embodiment of the present disclosure, the index position of each feature value in the storage space is determined according to the position of the feature value in the feature map and/or the convolution kernel and the data arrangement rule of each feature value in the feature map and/or the convolution kernel in the storage space. This makes it possible to accurately calculate the index position of the feature value in the feature map and/or convolution kernel to be read by each thread.

In one possible implementation, the calculating, by a plurality of first threads, index positions of feature values in the feature map and/or the convolution kernel includes: and controlling each thread to respectively calculate the index position of the characteristic value in the characteristic diagram and/or the convolution kernel according to the sequence of the identification ID of each thread.

Taking the calculation of the feature value of the convolution kernel as an example, each thread in the GPU has an identification ID, which is used to distinguish different threads, and the ID may be, for example: t0, T1, T2, and T3, the sequence of the thread ID may be [ T0, T1, T2, and T3] in which the numbers are arranged from small to large, and it should be noted that the sequence is a circular sequence, that is, the sequence of [ T0, T1, T2, and T3] is continuously reconnected after the sequence of T3 until the index position of the feature value in the convolution kernel is calculated.

In the embodiment of the present disclosure, since the threads themselves have IDs, the threads are controlled to calculate the index positions of the feature values in the feature map and/or the convolution kernel respectively according to the order of the identification IDs of the threads, so that the threads calculate the index positions of the feature values in parallel, thereby improving the efficiency of calculating the index positions.

In one possible implementation, the calculating, by a plurality of first threads, index positions of feature values in the feature map and/or the convolution kernel includes: and loading the characteristic value of the index position into a Shared Memory (SMem) for each thread to use.

Although the size of the shared memory is limited, the shared memory has higher read-write speed and bandwidth, and therefore, considering that the value of the index position (index value) occupies a smaller memory, in order to improve the read-write efficiency of the index value, the value of the index position can be stored in the SMem, and further, the efficiency of the convolution operation is improved, so as to improve the efficiency of the image processing operation.

In one possible implementation, the controlling the plurality of second threads to read the feature value in the convolution kernel according to the index position includes: controlling the plurality of second threads to read the characteristic values at the index positions into respective registers; performing convolution processing by using the read characteristic value to obtain convolution characteristics, including: and carrying out convolution processing by utilizing each characteristic value in the register to obtain convolution characteristics.

When using the sensor Core, the MMA directive is often used in conjunction with the ldmatrix directive. Generally, data of a matrix a and a matrix B required for computation by an MMA instruction are read from a Global Memory (GMem) and then put into the SMem, and then the data of the matrix a and the matrix B are read into registers of respective threads according to a specific matrix shape by using an ldmatrix instruction. The advantage of putting data into SMem is that there is data sharing between threads, which reduces the delay of reading data from GMem, however, for large-image and small-channel type matrix multiplication operation, there is little convolution data to be shared, and putting data into SMem for buffering will increase the delay of reading and writing SMem once.

Therefore, in the embodiment of the disclosure, the data required in the MMA instruction is directly read into the registers of the respective threads, so that the characteristic of convolution of large-image and few-channel types is fully utilized, and the delay of data reading is reduced.

In the embodiment of the disclosure, in the process of performing convolution operation, a matrix of the convolution operation may be subjected to block division, a feature matrix is divided into a plurality of feature blocks, and then the plurality of feature blocks are operated in parallel, so as to improve the efficiency of convolution. The three dimensions are described in detail below.

In one possible implementation manner, after the extracting the image feature in the target image, the method further includes: determining various first sizes of feature blocks which can be processed in a single thread bundle Warp, and generating a graphic processor core function for dividing the feature blocks based on the first sizes; the maximum value of the first size is determined according to the capacity of a register, the minimum value of the first size is the size of a minimum matrix unit calculated by a matrix multiply-add operation instruction, and the values of the various first sizes are multiples of the minimum value.

In the GPU, convolution is performed by a Matrix Multiply and Accumulate (MMA) instruction issued, illustratively, the minimum matrix unit handled by an MMA instruction is 16 × 8 in the M × N dimension and 8 in the K dimension.

In a single Warp, the size of the minimum matrix unit that can be processed by the Warp is the size of the minimum matrix unit that can be processed by the MMA instruction, and since the operation of the instruction is accumulated in the form of exponential power of 2, the size of the feature block that can be processed by the single Warp is the size of the minimum matrix unit multiplied by exponential power of 2, N may be 8, 16, 32, 64, and M may be 16, 32, 64, and 128, and then the obtained mxn value is specifically shown in table 1.

The maximum value of the feature block size cannot be increased infinitely in consideration of the limit of the register capacity, and the maximum value of the feature block size can be determined according to the register capacity, and the maximum value of the first size of the feature block is 128 × 32 or 64 × 64 because the Warp feature block of 128 × 64 cannot be stored in the register due to the limit of the register capacity, as shown in table 1.

M\N	8	16	32	64
					16	16x8	16x16	16x32	16x64
32	32x8	32x16	32x32	32x64
					64	64x8	64x16	64x32	64x64
128	128x8	128x16	128x32	\

TABLE 1 characteristic block size in Warp scale

In the embodiment of the disclosure, the first sizes of all the feature blocks can be traversed in the Warp dimension, and then the graph processor core function kernel is generated based on the first sizes, so that the obtained kernel can be suitable for segmenting the image features of various sizes, and the universality of the obtained kernel is high.

In one possible implementation manner, after the extracting the image feature in the target image, the method further includes: determining various second sizes of the feature blocks which can be processed in the single thread block TB, and generating a graphic processor core function for dividing the feature blocks based on the second sizes; the minimum value of the second size is a first size of a feature block which can be processed by a single thread beam Warp, the second size is a multiple of the first size, and the maximum value of the second size is determined according to the capacity of the shared memory and the upper limit of the number of threads in the TB.

One TB may include one or more warps, and then the minimum value of the second size is the first size of the feature block that can be processed by a single Warp, and the second size may also be a multiple of the first size, where the multiple is an exponential power of 2, i.e., 2, 4. M in Table 3_TBAnd N_TBIs a multiple of the first dimension.

In some GPUs, a TB may have 1024 threads at most, i.e. 16 warps, however, in the GPU, the number of threads is 128-512, the calculation efficiency is high, and therefore, 16 warps are not always run, and therefore, in order to ensure the GPU calculation efficiency, a TB runs 8 warps at most, i.e. M_TB×N_TBThe maximum value of (a) is 2 × 4 or 4 × 2, as shown in table 2.

M_TB\N_TB	1	2	4
				1	1x1	1x2	1x4
2	2x1	2x2	2x4
				4	4x1	4x2	\

Table 2 characteristic block sizes for TB level. The numbers in the table indicate the multiples of the block sizes of the corresponding Warp-level features

In the embodiment of the disclosure, the second sizes of all the feature blocks can be traversed in the TB dimension, and then the graph processor core function kernel is generated based on the second size, so that the obtained kernel can be suitable for segmenting the image features of various sizes, and the universality of the obtained kernel is high.

As described above, in a single Warp, the size of the minimum matrix unit that can be processed by the Warp is the size of the minimum matrix unit that can be processed by the MMA instruction, and for example, the minimum matrix unit that can be processed by one MMA instruction is 16 × 8 in the M × N dimension and 8 in the K dimension. Then, the sizes of the various possible K dimensions in a single TB include K8, K16, K32, etc., as shown in table 3.

K

8

16

32

TABLE 3 partition size in K dimension

In the embodiment of the disclosure, the third size of the packet can be traversed in the K dimension of the feature block, and then the graph processor core function kernel is generated based on multiple possible third sizes, so that the obtained kernel can be suitable for segmenting the image features of various sizes, and the universality of the obtained kernel is high.

Please refer to fig. 5, which is a schematic diagram of an application scenario of the image processing method according to the embodiment of the present disclosure, in the application scenario, the index positions of the feature values in the convolution kernel of 3 × 3 are pre-calculated: f (0,0), F (0,1), F (0,2), F (1,0), F (1,1), F (1,2), F (2,0), F (2,1), F (2,2), and put into SMem; then, the threads T0, T1, T2 and T3 read the pre-calculated indexes from the SMem in turn, fig. 5 only shows F (0,0), F (0,1), F (0,2) and F (1,0) read in 1 traversal, wherein 1 thread reads the feature values of the 2-layer channel at the same length and width position, as shown in fig. 5, C0 represents the 0 th layer of the channel, C1 represents the 1 st layer of the channel, T0 reads the feature values of the two channels at F (0,0), T1 reads the feature values of the two channels at F (0,1), T2 reads the feature values of the two channels at F (0,2), and T3 reads the feature values of the two channels at F (1,0), and then reads the corresponding convolution kernel data into the respective registers. The 1 MMA instruction can be executed by the four threads T0, T1, T2 and T3 together, after the data of the 8 channels shown in the figure are read by the four threads in total, the MMA instruction is executed once, invalid 0 data does not need to be filled in the execution, and the utilization rate of the Tensor Core is improved.

As can be seen from fig. 5, in the case of image processing by 4 threads, the MMA task needs to be executed 3 times by 4 threads, so that the convolution kernel of 3 × 3 can be traversed, and the two channels F (0,0), F (0,1), F (0,2), and F (1,0) are processed for the first time, and the minimum K dimension 8 of the MMA instruction can be filled up by counting up to 8 channels of data; the two-channel F (1,1), F (1,2), F (2,0) and F (2,1) are processed for the second time, the data of 8 channels can be filled in the minimum K dimension 8 of the MMA instruction, the two-channel F (2,2) is processed for the third time, the data of 2 channels cannot be filled in the minimum K dimension 8 of the MMA instruction, and therefore 60 s are filled for filling. Thus, compared with the operation manner in fig. 3, 60 s need to be filled in when the MMA instruction is executed once in fig. 3, and 9 MMA instructions need to be executed altogether, that is, 54 0 s need to be filled in altogether, and it is obvious that compared with the technique in fig. 3, the image processing method provided by the embodiment of the present disclosure can reduce the number of 0 s that are not filled in effectively when processing images with large images and few channels, and improve the utilization rate of the Tensor Core.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides an image processing apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the image processing methods provided by the present disclosure, and the descriptions and corresponding descriptions of the corresponding technical solutions and the corresponding descriptions in the methods section are omitted for brevity.

Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the apparatus 60 including:

the extraction unit 61 is configured to extract image features in the target image to obtain a feature map for representing the image features;

an index calculation unit 62 for calculating index positions of feature values in the feature map and/or the convolution kernel by a plurality of first threads;

and the convolution processing unit 63 is configured to control the plurality of second threads to read feature values in the feature map and/or the convolution kernel according to the index positions, and perform convolution processing by using the read feature values to obtain convolution features, where the convolution features are used to represent extraction results of the image features.

In one possible implementation, the apparatus further includes:

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 illustrates a block diagram of an electronic device 800 in accordance with an embodiment of the disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 7, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 8 illustrates a block diagram of an electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) From, onUnix-like operating system (Linux) with open source code^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, comprising:

2. The method according to claim 1, wherein the performing convolution processing by using the read feature values to obtain convolution features comprises:

3. The method of claim 1 or 2, wherein computing the index position of the feature value in the feature map and/or convolution kernel by a plurality of first threads comprises:

4. The method of any of claims 1-3, wherein at the index position of the feature value in the computation of the feature map and/or convolution kernel by the plurality of first threads, the method further comprises:

5. The method of any of claims 1-4, wherein controlling the plurality of second threads to read feature values in the feature map and/or convolution kernel according to the index position comprises:

6. The method according to any one of claims 1-5, wherein after said extracting the image features in the target image, the method further comprises:

7. The method according to any one of claims 1-6, wherein after said extracting the image features in the target image, the method further comprises:

8. An image processing apparatus characterized by comprising:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 7.

10. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 7.