CN109885407B - Data processing method and device, electronic equipment and storage medium - Google Patents

Data processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109885407B
CN109885407B CN201910164371.6A CN201910164371A CN109885407B CN 109885407 B CN109885407 B CN 109885407B CN 201910164371 A CN201910164371 A CN 201910164371A CN 109885407 B CN109885407 B CN 109885407B
Authority
CN
China
Prior art keywords
data
data block
block
processed
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910164371.6A
Other languages
Chinese (zh)
Other versions
CN109885407A (en
Inventor
李秀红
梁云
颜深根
张衡
贾连成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Sensetime Intelligent Technology Co Ltd
Original Assignee
Shanghai Sensetime Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Sensetime Intelligent Technology Co Ltd filed Critical Shanghai Sensetime Intelligent Technology Co Ltd
Priority to CN201910164371.6A priority Critical patent/CN109885407B/en
Publication of CN109885407A publication Critical patent/CN109885407A/en
Application granted granted Critical
Publication of CN109885407B publication Critical patent/CN109885407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Processing (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

The embodiment of the disclosure discloses a data processing method and device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining the data reuse relation of at least one second data block in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed based on the size of the data to be processed and the size of a convolution kernel, and performing conversion processing on the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to the first data block in the data to be processed to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.

Description

Data processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to deep learning technologies, and in particular, to a data processing method and apparatus, an electronic device, and a storage medium.
Background
The deep learning network has wide application in the fields of audio frequency, image processing and the like. The convolution operation is one of the most core operators in deep learning, can effectively extract characteristic information and is of great importance in a deep learning network. The convolution operators mainly have two types of algorithms, the first type is a common algorithm based on matrix multiplication, the second type is based on a fast algorithm, the speed of convolution operation can be improved, however, the speed of convolution operation based on the fast algorithm still needs to be further improved.
Disclosure of Invention
The embodiment of the disclosure provides a data processing technology.
According to an aspect of the embodiments of the present disclosure, there is provided a data processing method, including:
determining a data reuse relation of at least one second data block in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed based on the size of the data to be processed and the size of a convolution kernel, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
performing conversion processing on the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to a first data block in the data to be processed to obtain the conversion result matrix;
and performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, in any of the above method embodiments of the present disclosure, the first data block includes a row of data of the image in one channel.
Optionally, in any one of the method embodiments of the present disclosure, the second data block is a data block with a size of R × P, where R is a number of columns of the convolution kernel, and P is a number of times of horizontal sliding in each channel when the convolution kernel is used to perform convolution operation on the data to be processed.
Optionally, in any one of the method embodiments of the present disclosure, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block in the channel to which the first data block belongs.
Optionally, in any one of the method embodiments of the present disclosure, if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, in any one of the above method embodiments of the present disclosure, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Optionally, in any method embodiment of the present disclosure, the converting, based on a data reuse relationship of at least one second data block in the conversion result matrix to a first data block in the to-be-processed data, the to-be-processed data to obtain the conversion result matrix includes:
allocating a thread block to each first data block in the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to the first data block in the data to be processed;
reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block, and writing the read first data block into at least one second data block which has reuse relation to the first data block in the conversion result matrix.
Optionally, in any one of the method embodiments of the present disclosure, the allocating a thread block to each first data block in the to-be-processed data based on a data reuse relationship between a plurality of second data blocks in the conversion result matrix and the first data block in the to-be-processed data includes:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks;
and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
Optionally, in any one of the method embodiments of the present disclosure, the determining, based on a number of each thread block in the multiple thread blocks, a corresponding first data block of each thread block in the to-be-processed data includes:
and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the serial number of the first thread block in the thread blocks, the number of channels C contained in each image in the data to be processed and the number of thread blocks required by each channel.
Optionally, in any of the method embodiments of the present disclosure above, the method further includes:
storing the read first data block into a shared memory by using the thread block allocated to the first data block;
the writing the read first data block into at least one second data block in a conversion result matrix having a reuse relationship with the first data block comprises:
reading at least a portion of the first data block stored in the shared memory, and writing the read data into a second data block in the conversion result matrix.
Optionally, in any of the above method embodiments of the present disclosure, each of the thread blocks includes T threads;
before storing the read first data block into the shared memory by using the thread block allocated to the first data block, the method further includes:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
Optionally, in any one of the method embodiments of the present disclosure, the reading, by using a thread block allocated to the first data block, the first data block from the data to be processed, and writing the read first data block into at least one second data block having a reuse relationship with the first data block in the conversion result matrix includes:
and performing sliding reading on the first data block through the thread block allocated to the first data block, and writing the data read each time into one row of each second data block corresponding to the first data block.
According to another aspect of the embodiments of the present disclosure, there is provided a data processing method, including:
reading each first data block in a plurality of first data blocks in data to be processed, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix to obtain the conversion result matrix, wherein the at least one second data block reuses the first data block;
and performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, in any of the above method embodiments of the present disclosure, the first data block includes a row of data of the image in one channel.
Optionally, in any one of the method embodiments of the present disclosure, the second data block is a data block with a size of R × P, where R is a number of columns of the convolution kernel, and P is a number of times of horizontal sliding in each channel when the convolution kernel is used to perform convolution operation on the data to be processed.
Optionally, in any one of the method embodiments of the present disclosure, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block in the channel to which the first data block belongs.
Optionally, in any one of the method embodiments of the present disclosure, if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, in any one of the above method embodiments of the present disclosure, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Optionally, in any one of the method embodiments of the present disclosure, before writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix, and obtaining the conversion result matrix, the method further includes:
determining a data reuse relationship between the at least one second data block and the first data block in the conversion result matrix based on the size of the data to be processed and the size of the convolution kernel;
the writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix includes:
writing the first data block into the at least one second data block in the conversion result matrix based on a data reuse relationship of the at least one second data block in the conversion result matrix with the first data block.
Optionally, in any one of the method embodiments of the present disclosure, the reading each first data block in a plurality of first data blocks in the data to be processed includes:
allocating a thread block to each first data block in the data to be processed;
and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
Optionally, in any one of the method embodiments of the present disclosure, the allocating a thread block to each first data block in the to-be-processed data includes:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks;
and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
Optionally, in any one of the method embodiments of the present disclosure, the determining, based on a number of each thread block in the multiple thread blocks, a corresponding first data block of each thread block in the to-be-processed data includes:
and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the serial number of the first thread block in the thread blocks, the number of channels C contained in each image in the data to be processed and the number of thread blocks required by each channel.
Optionally, in any of the method embodiments of the present disclosure above, the method further includes:
storing the read first data block into a shared memory by using the thread block allocated to the first data block;
the writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix includes:
reading at least a portion of the first data block stored in the shared memory, and writing the read data into a second data block in the conversion result matrix.
Optionally, in any of the above method embodiments of the present disclosure, each of the thread blocks includes T threads;
before storing the read first data block into the shared memory by using the thread block allocated to the first data block, the method further includes:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
Optionally, in any one of the method embodiments of the present disclosure, the writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix includes:
and writing the data read by sliding the first data block each time into one row of each second data block corresponding to the first data block.
According to still another aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:
a data reuse relation determining unit, configured to determine a data reuse relation between at least one second data block in a conversion result matrix corresponding to data to be processed and a first data block in the data to be processed based on a size of the data to be processed and a size of a convolution kernel, where the data to be processed includes N images, each of the images includes C channels, and at least one of N and C is an integer greater than 1;
the data conversion unit is used for performing conversion processing on the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to a first data block in the data to be processed to obtain the conversion result matrix;
and the result calculation unit is used for performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, in any of the above apparatus embodiments of the present disclosure, the first data block includes a row of data of the image in one channel.
Optionally, in an embodiment of any one of the apparatuses in the present disclosure, the second data block is a data block with a size of R × P, where R is a number of columns of the convolution kernel, and P is a number of times of horizontal sliding in each channel when the convolution kernel is used to perform convolution operation on the data to be processed.
Optionally, in any apparatus embodiment of the present disclosure, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block in the channel to which the first data block belongs.
Optionally, in any apparatus embodiment of the present disclosure, if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, in any one of the apparatus embodiments of the present disclosure, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Optionally, in an embodiment of the apparatus of the present disclosure, the data conversion unit is configured to allocate a thread block to each first data block in the to-be-processed data based on a data reuse relationship of at least one second data block in the conversion result matrix to the first data block in the to-be-processed data; reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block, and writing the read first data block into at least one second data block which has reuse relation to the first data block in the conversion result matrix.
Optionally, in any apparatus embodiment of the present disclosure above, when allocating a thread block to each first data block in the to-be-processed data based on a data reuse relationship of a plurality of second data blocks in the conversion result matrix to the first data block in the to-be-processed data, the data conversion unit is configured to determine, based on a number of each thread block in a plurality of thread blocks, a corresponding first data block in the to-be-processed data of each thread block; and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
Optionally, in any apparatus embodiment of the present disclosure, when determining, based on a number of each thread block in a plurality of thread blocks, a corresponding first data block of each thread block in the to-be-processed data, the data conversion unit is configured to determine, based on the number of the first thread block in the plurality of thread blocks, a number of channels C included in each image in the to-be-processed data, and a number of thread blocks required by each channel, an image corresponding to the first thread block in the to-be-processed data, a channel in the image, and a row corresponding to the first data block in the channel.
Optionally, in any one of the apparatus embodiments of the present disclosure above, the apparatus further includes:
a data reading unit, configured to store the read first data block in a shared memory by using a thread block allocated to the first data block;
the data conversion unit is configured to read at least a portion of the first data block stored in the shared memory, and write the read data into a second data block in the conversion result matrix.
Optionally, in any of the above apparatus embodiments of the present disclosure, each of the thread blocks includes T threads;
the data reading unit is further configured to determine, based on the size W of the shared memory and the number T of threads, data of the first data block read by each thread of the T threads.
Optionally, in an embodiment of any one of the apparatuses in the present disclosure, when reading the first data block from the to-be-processed data by using the thread block allocated to the first data block, and writing the read first data block into at least one second data block in the conversion result matrix, where the second data block has a reuse relationship with the first data block, the data conversion unit is specifically configured to perform sliding reading on the first data block by using the thread block allocated to the first data block, and write the data read each time into one row of each second data block corresponding to the first data block.
According to still another aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including:
the data reading unit is used for reading each first data block in a plurality of first data blocks in data to be processed, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
a data conversion unit, configured to write the first data block into at least one second data block corresponding to the first data block in a conversion result matrix, to obtain the conversion result matrix, where the at least one second data block reuses the first data block;
and the result calculation unit is used for performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, in any of the above apparatus embodiments of the present disclosure, the first data block includes a row of data of the image in one channel.
Optionally, in an embodiment of any one of the apparatuses in the present disclosure, the second data block is a data block with a size of R × P, where R is a number of columns of the convolution kernel, and P is a number of times of horizontal sliding in each channel when the convolution kernel is used to perform convolution operation on the data to be processed.
Optionally, in any apparatus embodiment of the present disclosure, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block in the channel to which the first data block belongs.
Optionally, in any apparatus embodiment of the present disclosure, if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, in any one of the apparatus embodiments of the present disclosure, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Optionally, in any one of the apparatus embodiments of the present disclosure, the apparatus further includes:
a data reuse relation determining unit, configured to determine a data reuse relation between the at least one second data block and the first data block in the conversion result matrix based on a size of data to be processed and a size of the convolution kernel;
the data conversion unit is configured to write the first data block into the at least one second data block in the conversion result matrix based on a data reuse relationship between the at least one second data block in the conversion result matrix and the first data block.
Optionally, in an embodiment of any one of the apparatus of the present disclosure above, the data reading unit is configured to allocate a thread block to each first data block in the to-be-processed data; and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
Optionally, in any apparatus embodiment of the present disclosure above, when allocating a thread block to each first data block in the to-be-processed data, the data reading unit is configured to determine, based on a number of each thread block in a plurality of thread blocks, a corresponding first data block in the to-be-processed data of each thread block; and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
Optionally, in an embodiment of the apparatus of the present disclosure, when determining, based on a number of each thread block in a plurality of thread blocks, a corresponding first data block of each thread block in the to-be-processed data, the data reading unit is configured to determine, based on the number of the first thread block in the plurality of thread blocks, a number of channels C included in each image in the to-be-processed data, and a number of thread blocks required by each channel, an image corresponding to the first thread block in the to-be-processed data, a channel in the image, and a row corresponding to the first data block in the channel.
Optionally, in an embodiment of any one of the apparatuses of the present disclosure, the data reading unit is further configured to store the read first data block in a shared memory by using a thread block allocated to the first data block;
the data conversion unit is configured to read at least a portion of the first data block stored in the shared memory, and write the read data into a second data block in the conversion result matrix.
Optionally, in any of the above apparatus embodiments of the present disclosure, each of the thread blocks includes T threads;
the data reading unit is further configured to determine, based on the size W of the shared memory and the number T of threads, data of the first data block read by each thread of the T threads.
Optionally, in an embodiment of any one of the apparatuses in the present disclosure, the data conversion unit is configured to write the data read by sliding the first data block each time into one row of each second data block corresponding to the first data block.
According to a further aspect of the embodiments of the present disclosure, there is provided an electronic device including a processor including the data processing apparatus as described in any one of the above.
According to still another aspect of the embodiments of the present disclosure, there is provided an electronic device, including: a memory for storing executable instructions;
and a processor in communication with the memory for executing the executable instructions to perform the operations of the data processing method as described in any one of the above.
According to still another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium for storing computer-readable instructions which, when executed, perform the operations of the data processing method as described in any one of the above.
According to a further aspect of the embodiments of the present disclosure, there is provided a computer program product including computer readable code, wherein when the computer readable code runs on a device, a processor in the device executes instructions for implementing the data processing method according to any one of the above.
According to yet another aspect of the embodiments of the present disclosure, there is provided another computer program product for storing computer readable instructions, which when executed, cause a computer to perform the operations of the data processing method in any one of the above possible implementations.
In an alternative embodiment the computer program product is embodied as a computer storage medium, and in another alternative embodiment the computer program product is embodied as a software product, such as an SDK or the like.
According to the embodiment of the present disclosure, another data processing method and apparatus, an electronic device, a computer storage medium, and a computer program product are also provided, wherein a data reuse relationship of at least one second data block in a conversion result matrix corresponding to data to be processed with respect to a first data block in the data to be processed is determined based on a size of the data to be processed and a size of a convolution kernel, and the data to be processed is converted based on the data reuse relationship of the at least one second data block in the conversion result matrix with respect to the first data block in the data to be processed, so as to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Based on the data processing method and device, the electronic device, and the storage medium provided by the embodiments of the present disclosure, based on the size of the data to be processed and the size of the convolution kernel, the data reuse relationship of the first data block in the data to be processed by at least one second data block in the conversion result matrix corresponding to the data to be processed is determined, and based on the data reuse relationship of the first data block in the data to be processed by at least one second data block in the conversion result matrix, the data to be processed is converted to obtain the conversion result matrix, and due to the existence of the data reuse relationship, the speed of obtaining the conversion result matrix is increased, and the difficulty of converting convolution calculation into matrix multiplication is reduced; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of convolution operation.
Fig. 2 is a schematic diagram of convolution calculation based on matrix multiplication.
Fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure.
FIG. 4 is an exemplary diagram of obtaining a result matrix based on pending data according to an embodiment of the disclosure.
Fig. 5 is a schematic diagram of a data reuse structure of a data processing method according to an embodiment of the present disclosure.
Fig. 6 is another schematic structural diagram of the data processing method according to the embodiment of the disclosure.
FIG. 7 is a diagram illustrating an image processor.
Fig. 8 is a schematic diagram illustrating a thread block reading data in a first data block in a data processing method according to an embodiment of the present disclosure.
Fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.
Fig. 10 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure.
Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.
Fig. 12 is a schematic structural diagram of an electronic device suitable for implementing a terminal device or a server according to an embodiment of the present disclosure.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
First, a simple description is made of the convolution operation related to the embodiment of the present disclosure, so as to better understand the technical solution of the embodiment of the present disclosure.
Given an Input Feature tensor (Input Feature), typically a C H W tensor, where C is the number of channels and H W is the height and width of the Feature matrix; and a series of convolution kernels, typically expressed as a tensor of K C R S, where K is the number of convolution kernels, C is the number of channels per convolution kernel consistent with the number of input feature tensor channels, and R S is the height and width of the convolution kernels.
The convolution definition is shown in fig. 1, and fig. 1 is a schematic diagram of convolution operation. The input feature tensor in this example is 3 × 5 × 5, the convolution kernel is 1 × 3 × 3 × 3, and the convolution operation consists of two steps. First, each channel of the convolution kernel is placed opposite the top left corner of each channel of the feature tensor, as shown in figure 1. Then respectively unfolding the feature tensor and the convolution kernel of the corresponding area to form two vectors (input feature vector kernel convolution kernel vectors); and secondly, carrying out vector multiplication operation, wherein the result is one element in the output characteristic. Then, the convolution kernel is slid, and the two operations are repeated. In this example, three positions are slidable in the lateral and longitudinal directions, respectively, so that a 3 × 3 output feature matrix can be generated. Using the same approach for each convolution kernel results in a 3 x 3 output feature matrix when the number of convolution kernels is K. Therefore, the output feature tensor finally produced is K × 3 × 3.
There are two main types of algorithms for implementing convolution operators, the first type is based on matrix multiplication, and the second type is based on fast algorithm. The first type of algorithm is a pervasive algorithm, which is suitable for all types of convolutions; the second class of algorithms can only be targeted to specific scenarios, such as convolution kernel size of 3 × 3, while the step size of convolution kernel sliding is 1, etc.
Fig. 2 is a schematic diagram of convolution calculation based on matrix multiplication. As shown in fig. 2, the convolution kernel can be regarded as a matrix with K rows and C × R × S columns, and the four-dimensional tensor with K × C × R × S and the matrix with K rows and C × R × S columns are only different in the understanding logic of data, and do not need to be physically stored, so that the tensor-to-matrix conversion is not needed. In this example, the convolution kernel is an a matrix, and has a size of 64 × 27. For the input feature tensor, the data corresponding to each sliding window can be expanded into BK×NThe input feature tensor can be converted into a C x R x S row by P x column matrix, where P and Q are the number of times the convolution kernel can be slid in the lateral and longitudinal directions, respectively. In this example, the sliding window can be slid 222 × 222 49284 times along the horizontal and vertical directions, and the input feature tensor needs to be converted into a B matrix with a size of 27 × 49284. The conversion process from the input eigentensor to the input eigenmatrix is referred to as im2col operation.
In the process of researching the convolution operation based on the fast algorithm, the inventor creatively discovers the data reuse mode in the fast algorithm, and provides a new fast convolution operation based on the discovered data reuse mode, thereby improving the performance of the convolution operation.
It should be understood that the embodiment of the present disclosure is mainly applied to im2col fast convolution operation, but may also be applied to other similar fast convolution operations, which is not limited by the embodiment of the present disclosure. In addition, the embodiment of the disclosure can be applied to processing systems such as a GPU.
Fig. 3 is a schematic flow chart of a data processing method according to an embodiment of the present disclosure.
Step 310, determining a data reuse relationship of at least one second data block in the conversion result matrix corresponding to the data to be processed to the first data block in the data to be processed, based on the size of the data to be processed and the size of the convolution kernel.
The data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1.
In the process of implementing the present disclosure, the inventor finds that, in the process of converting an input data (e.g., an image or a feature) tensor into an input data matrix, a corresponding relationship between a second data block and a first data block in data to be processed exists in an obtained conversion result matrix, and each first data block corresponds to at least one second data block.
Step 320, performing conversion processing on the data to be processed based on the data reuse relationship of the at least one second data block in the conversion result matrix to the first data block in the data to be processed, so as to obtain a conversion result matrix.
The data to be processed is converted through the data reuse relationship, optionally, the data in each first data block can be written into at least one second data block corresponding to the first data block, the conversion result matrix is obtained and independently calculated and input by the data of each original position, and the data is replaced by the corresponding data blocks in the embodiment disclosed by the invention, so that the data entry speed is improved, the calculated amount is reduced, and the rapid conversion of the data to be processed is realized.
And 330, performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
The present disclosure is directed to convolution algorithms based on matrix multiplication. In the definition of convolution, the computation for the elements in the output feature tensor is a vector multiplication. Thus, the input feature tensor can be converted into a corresponding matrix by data rearrangement, and the convolution operation is further converted into a matrix multiplication C ═ a × B, where a isM×K,BK×NAnd CM×NIs a matrix.
Based on the data processing method provided by the embodiment of the disclosure, based on the size of the data to be processed and the size of the convolution kernel, determining the data reuse relationship of a plurality of second data blocks in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed, and based on the data reuse relationship of the plurality of second data blocks in the conversion result matrix to the first data block in the data to be processed, performing conversion processing on the data to be processed to obtain a conversion result matrix, wherein due to the existence of the data reuse relationship, the speed of obtaining the conversion result matrix is increased, and the difficulty of converting convolution calculation into matrix multiplication is reduced; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
In one or more alternative embodiments, the first data block includes a line of data of the image in one pass.
In an embodiment of the disclosure, each first data block corresponds to one line of one image in one channel, and one image includes the corresponding number of lines of first data blocks in one channel, for example, the image size is H × M × C, where C is the number of channels, M is the height, and H is the width, and every 1 × H data is taken as one line, then the first data block includes H data, and the image includes M first data blocks in one channel, and the image includes M first data blocks. The embodiment of the present disclosure determines that a row of an image in one channel corresponds to a first data block by finding, through research by the inventors, that a row of data of an image in one channel corresponds to a conversion result matrix.
In one or more alternative embodiments, the second data block is a data block with a size R × P, where R is the number of columns of the convolution kernel, and P is the number of horizontal slips in each channel when the convolution kernel is used to perform the convolution operation on the data to be processed.
FIG. 4 is an exemplary diagram of obtaining a result matrix based on pending data according to an embodiment of the disclosure. As shown in fig. 4, the data to be processed shown in the figure includes two images, each image includes two channels, the size of the image in each channel is 6 (number of rows) × 6 (number of columns), and the convolution kernel size in this embodiment is 3 (number of rows) × 3 (number of columns). The inventor finds that the following rules exist between the data to be processed and the conversion result matrix through research: first, the data in one channel of the input image corresponds to a block of data of size S (number of convolution kernel rows) × R (number of convolution kernel columns) × P (number of lateral shifts) × Q (number of longitudinal shifts) in the transformed conversion result matrix. An image includes C channels, each of which corresponds to a block of data of size S x R x P x Q arranged in a vertical direction, thus forming a matrix of C x S x R x P x Q. When the number of images included in the data to be processed is N, it is equivalent to spreading along the horizontal direction, so that a matrix of conversion results with the size C × S × R × PQ × N is formed.
Secondly, the inventor of the present application has studied that a row of data (H pixels) of an image in one channel corresponds to a block of data with a size P × R ═ 4 × 3 (the number of horizontal shifts × the number of convolution kernel columns) in the transformed matrix. Therefore, each line in the input feature tensor can complete a data block with the size of P multiplied by R only by reading once, and the repeated reading of the video memory is avoided.
In addition to the above two points, during research, the present inventors found that each first data block corresponds to at least one second data block in corresponding S × R × P × Q data blocks in one channel of an image, and optionally, the number of the at least one second data block corresponding to the first data block depends on the corresponding number of rows of the first data block in the channel to which the first data block belongs.
Optionally, if the number of rows L corresponding to the first data block is less than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
In the embodiment of the present disclosure, the number of the second data blocks corresponding to each first data block is determined, and the number of the second data blocks corresponding to the first data blocks corresponding to different numbers of rows is different, and specifically, the number of the plurality of second data blocks corresponding to the first data block may be determined based on the following formula (1):
l ═ ((b% ((Q + S-1) × C))% (Q + S-1)) formula (1)
Wherein L represents the number of rows corresponding to the current first data block, R represents the number of columns of the convolution kernel, b represents the thread block number,% represents the modulo operation, and when L < R, the number (Task) of the plurality of second data blocks corresponding to the first data block is L; when R < ═ L < ═ Q, Task ═ R; when L > Q, Task ═ (Q + S-1-L). Due to the data reuse relation, a plurality of second data blocks with the size of P multiplied by R can be completed only by reading once for each row of data of the image in each channel in the data to be processed, and further, the repeated reading of the video memory is avoided.
When the at least one second data block corresponding to one first data block is a plurality of second data blocks, optionally, the plurality of second data blocks corresponding to the first data block includes at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Fig. 5 is a schematic diagram of a data reuse structure of a data processing method according to an embodiment of the present disclosure. As shown in fig. 5, in one example, R × P second data blocks along the same diagonal in the embodiment of the present disclosure correspond to one first data block, and a right neighbor of one second data block in two adjacent second data blocks corresponding to the same first data block is a lower neighbor of the other second data block. In this manner, each first data block corresponds to a series of R × P sized second data blocks along the diagonal. The number of lines (the number of first data blocks) in the data to be processed in which an image is processed in each input channel is Q + S-1.
Fig. 6 is another schematic structural diagram of the data processing method according to the embodiment of the disclosure. As shown in figure 6 of the drawings,
step 610, determining a data reuse relationship of a plurality of second data blocks in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed based on the size of the data to be processed and the size of the convolution kernel.
The data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1.
Optionally, in the embodiment of the present disclosure, the specific operation of step 610 may refer to step 310 in the above implementation, and is not described herein again.
Step 620, based on the data reuse relation of at least one second data block in the conversion result matrix to the first data block in the data to be processed, allocating a thread block to each first data block in the data to be processed.
Alternatively, the data processing provided by the embodiments of the present disclosure may be implemented based on a processing device, for example, a Graphics Processor (GPU) which performs a task that is actually an instance of a computing task called a computing Grid (Grid). A computational grid contains hundreds of thousands of thread blocks, each containing hundreds of threads. In order to increase the processing speed of the data to be processed, it is necessary to increase the utilization rates of the thread blocks and the threads, that is, matrix conversion is performed on the data to be processed by using an appropriate number of thread blocks, and optionally, one thread block is allocated to each first data block in the embodiments of the present disclosure, so that the number of thread blocks required for processing the data to be processed is (Q + S-1) for one image in one channel in the data to be processed, and the number of thread blocks required for processing the data to be processed is (Q + S-1) C × N corresponding to the number of rows of the image to be processed in one channel.
Step 630, reading the first data block from the data to be processed by using the thread block allocated to the first data block, and writing the read first data block into at least one second data block having a reuse relationship with the first data block in the conversion result matrix.
In the processing equipment, the thread blocks distributed for each first data block respectively read the first data block from the data to be processed, and write the read data into all second data blocks corresponding to the first data block according to the corresponding relation between the second data block and the first data block, so that efficient data indexing is realized and the data to be processed is quickly converted into a conversion result matrix through reading and writing.
And step 640, performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
The embodiment of the disclosure improves the processing speed of the data to be processed by allocating the thread blocks, improves the utilization rate of the threads in the processing equipment, and reduces the number of idle threads.
Optionally, step 620 comprises:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in the thread blocks;
and each thread block in the plurality of thread blocks is allocated to the first data block corresponding to each thread block.
When a thread block is allocated to each first data block, optionally, according to the number of the thread block, it is calculated to which first data block in the data to be processed the task allocated to the thread block corresponds. Optionally, determining, based on the number of each thread block in the plurality of thread blocks, a corresponding first data block of each thread block in the data to be processed may include: and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the number of the first thread block in the thread blocks, the number C of channels contained in each image in the data to be processed and the number of thread blocks required by each channel. The number of thread blocks required by each channel can be expressed as (Q + S-1). Specifically, if the thread block number is b, the corresponding image in the data to be processed is b/((Q + S-1) × C), where b/((Q + S-1) × C) represents the quotient of b divided by ((Q + S-1) × C), the quotient indicates a number corresponding to the image, for example, the thread block number b is 15, the data to be processed includes 2 images of 2 channels, each channel includes 6 lines of data to be processed, Q is 4, S is 3, at this time, calculate b/((Q + S-1) × C) ═ 1, number 0 for the first image in the data to be processed, and therefore, in this example, the number of the image corresponding to the thread block with the number of 15 is 1, since the image number starts from 0, that is, the thread block corresponds to the second image in the data to be processed; determining which image is then determined by calculating (b% ((Q + S-1) × C))/(Q + S-1)) which channel in the image, wherein,% represents a modulo operation, (b% ((Q + S-1) ×/(Q + S-1)) represents a quotient obtained by dividing b by ((Q + S-1) × C)) and dividing by (Q + S-1)), the quotient representing a channel number corresponding to the image, e.g., a thread block number b takes a value of 15, data to be processed includes 2 channels of images, each channel includes 6 lines of data to be processed, Q takes a value of 4, and S takes a value of 3, at which time, (b% ((Q + S-1) × C)) is calculated to be 3, and 3/6 is calculated to be 0, i.e., the block corresponds to the channel number of the image to be 0, since the channel number starts from 0, the channel number corresponds to the first channel; the number of lines of the image corresponding to the thread block in one channel is ((b% ((Q + S-1) × C))% (Q + S-1)), for example, the thread block number b is 15, the data to be processed includes 2 images of 2 channels, each channel includes 6 lines of data to be processed, Q is 4, and S is 3, and at this time ((b% ((Q + S-1) × C)) -3, 3% (Q + S-1)) -3 is calculated, that is, it means that the image corresponding to the thread block corresponds to the line with the number of 3 in the channel with the number of 0, and since the line number starts from 0, the thread block corresponds to the 4 th line.
In one or more optional embodiments, the data processing method provided in the embodiments of the present disclosure further includes:
storing the read first data block into a shared memory by using a thread block allocated to the first data block;
writing the read first data block into at least one second data block having a reuse relationship with the first data block in the conversion result matrix, including:
reading at least a portion of a first data block stored in the shared memory and writing the read data to a second data block in the conversion result matrix.
When the processing device is used for realizing data processing, due to the characteristics of the processing device, in order to reduce time delay and accelerate convolution processing, the embodiment of the disclosure stores the data to be processed into at least one shared memory through the thread block allocated to each first data block, and the thread block directly reads the data from the shared memory in the calculation process, so that the processing efficiency is improved.
FIG. 7 is a diagram illustrating an image processor. As shown in fig. 7, the GPU is composed of a plurality of streaming multiprocessors and one video memory. Each stream multiprocessor internally comprises a plurality of vector computing units and a shared memory, and the delay of reading the shared memory by the computing units is far less than that of a video memory. Therefore, in order to improve the computational efficiency, the application of the shared memory is important. A thread block may request a portion of shared memory that is visible to all threads in a thread block. Therefore, the task of one thread block is treated as a basic task unit. Because the shared memory size on the streaming multiprocessor is limited, the larger the shared memory applied by each thread block is, the fewer the number of thread blocks that can be simultaneously executed by one streaming multiprocessor is, and generally, the worse the performance is, in the embodiment of the present disclosure, by allocating one thread block to each first data block, the number of thread blocks that can be simultaneously executed is increased, and the performance of the processing device is improved.
Optionally, each thread block comprises T threads;
before the first data block is read from the data to be processed by using the thread block allocated to the first data block, the method further comprises the following steps:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
When the size of the shared memory is W, a line of data in one channel in the image is read into the shared memory through T threads in the thread block, and each thread is responsible for reading W/T pixels. Exceptionally, for the case where W/T cannot be divided exactly, where tmp represents the remainder, at which time the remainder tmp is not 0, then the first tmp threads are set to read W/T +1 pixels, at which time W/T represents the quotient of W divided by T, and each of the remaining (T-tmp) threads reads W/T pixels.
In one or more alternative embodiments, step 630 includes:
and performing sliding reading on the first data block through the thread block allocated to the first data block, and writing the data read each time into one row in each second data block corresponding to the first data block.
Optionally, for each shared memory, the feature data in the form of vectors stored in the shared memory is read according to the length P and the step length 1, the data read each time is written into one row in the second data block, the feature data stored in all the shared memories is read into the result matrix, and the convolution calculation result of the feature set to be processed is obtained. Fig. 8 is a schematic diagram illustrating a thread block reading data in a first data block in a data processing method according to an embodiment of the present disclosure. As shown in fig. 8, there is a corresponding relationship between data in each first data block and data in a corresponding second data block, and data in the first data block is written into the second data block by sliding reading, so as to implement conversion of data to be processed, each first data block corresponds to one thread block, each thread block includes T threads, and thus, each thread is responsible for writing (P R)/T data.
In the embodiment of the present disclosure, a plurality of thread blocks are executed in parallel on the GPU, so as to improve the speed of convolution calculation, and three aspects need to be considered in designing a parallel algorithm on the GPU: first, how to split the complete task into independent subtasks (how to determine the first data block in the embodiment of the present disclosure); second, how to allocate the subtasks into GPU thread blocks (the number of thread blocks is determined in the embodiments of the present disclosure); thirdly, designing details of the implementation of the subtask (in the embodiment of the present disclosure, each thread block stores data in a first data block into the shared memory, and then reads and writes data from the shared memory into all second data blocks corresponding to the first data block).
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 9, the apparatus of this embodiment includes:
the data reuse relationship determining unit 91 is configured to determine, based on the size of the to-be-processed data and the size of the convolution kernel, a data reuse relationship between at least one second data block in the conversion result matrix corresponding to the to-be-processed data and the first data block in the to-be-processed data.
The data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1.
The data conversion unit 92 is configured to perform conversion processing on the data to be processed based on the data reuse relationship of the first data block in the data to be processed in the at least one second data block in the conversion result matrix, so as to obtain a conversion result matrix.
And the result calculating unit 93 is configured to perform matrix multiplication on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Based on the data processing device provided by the above embodiment of the present disclosure, based on the size of the data to be processed and the size of the convolution kernel, the data reuse relationship of the plurality of second data blocks in the conversion result matrix corresponding to the data to be processed with respect to the first data block in the data to be processed is determined, and based on the data reuse relationship of the plurality of second data blocks in the conversion result matrix with respect to the first data block in the data to be processed, the data to be processed is converted to obtain the conversion result matrix, and due to the existence of the data reuse relationship, the speed of obtaining the conversion result matrix is increased, and the difficulty of converting convolution calculation into matrix multiplication is reduced; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, the first data block comprises one line of data of the image in one channel.
Optionally, the second data block is a data block with a size R × P, where R is the number of columns of the convolution kernel, and P is the number of horizontal sliding in each channel when the convolution kernel is used to perform a convolution operation on the data to be processed.
Optionally, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block corresponding to the channel to which the first data block belongs.
Optionally, if the number of rows L corresponding to the first data block is less than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
In one or more optional embodiments, the data conversion unit 92 is configured to allocate a thread block to each first data block in the data to be processed based on a data reuse relationship of at least one second data block in the conversion result matrix to the first data block in the data to be processed; and reading the first data block from the data to be processed by utilizing the thread block allocated for the first data block, and writing the read first data block into at least one second data block which has reuse relation to the first data block in the conversion result matrix.
Optionally, the data conversion unit 92 is configured to, when allocating a thread block to each first data block in the data to be processed based on the data reuse relationship between the plurality of second data blocks in the conversion result matrix and the first data block in the data to be processed, determine, based on the number of each thread block in the plurality of thread blocks, a corresponding first data block in the data to be processed of each thread block; and each thread block in the plurality of thread blocks is allocated to the first data block corresponding to each thread block.
Optionally, when determining, based on the number of each thread block in the plurality of thread blocks, a corresponding first data block in the data to be processed of each thread block, the data conversion unit 92 is configured to determine, based on the number of the first thread block in the plurality of thread blocks, the number of channels C included in each image in the data to be processed, and the number of thread blocks required by each channel, an image corresponding to the first thread block in the data to be processed, a channel in the image, and a row corresponding to the first data block in the channel.
Optionally, the apparatus provided in the embodiment of the present disclosure further includes:
the data reading unit is used for storing the read first data block into the shared memory by using the thread block allocated to the first data block;
the data conversion unit 92 is configured to read at least a portion of the first data block stored in the shared memory, and write the read data into the second data block in the conversion result matrix.
Optionally, each thread block comprises T threads;
and the data reading unit is further used for determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
Optionally, when the data conversion unit reads the first data block from the data to be processed by using the thread block allocated to the first data block and writes the read first data block into at least one second data block having a reuse relationship with the first data block in the conversion result matrix, the data conversion unit is specifically configured to perform sliding reading on the first data block by using the thread block allocated to the first data block and write the data read each time into one row of each second data block corresponding to the first data block.
Fig. 10 is a schematic flowchart of a data processing method according to an embodiment of the present disclosure.
Step 1010, reading each first data block of a plurality of first data blocks in the data to be processed.
The data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than or equal to 1.
In the embodiment of the present disclosure, data to be processed is decomposed into a plurality of first data blocks, optionally, any part of an image in one channel is taken as one first data block, for example, one line of the image in one channel is taken as one first data block, or parts of the image in multiple lines or one line in one channel are taken as one first data block, and the size of the specific first data block is not limited in the embodiment of the present disclosure.
Step 1020, writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix to obtain the conversion result matrix.
Wherein the at least one second data block reuses the first data block.
In the embodiment of the present disclosure, each first data block corresponds to at least one second data block, and through such a one-to-many correspondence relationship, the obtaining process of the conversion result matrix is simplified, for example, data in each first data block may be written into each second data block of the at least one second data block corresponding thereto, so that conversion efficiency is improved.
And step 1030, performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
The present disclosure is directed to convolution algorithms based on matrix multiplication. In the definition of convolution, the computation for the elements in the output feature tensor is a vector multiplication. Thus, the input feature tensor can be converted into a corresponding matrix by data rearrangement, and the convolution operation is further converted into a matrix multiplication C ═ a × B, where a isM×K,BK×NAnd CM×NIs a matrix.
Based on the data processing method provided by the above embodiment of the present disclosure, each of a plurality of first data blocks in data to be processed is read, and the first data block is written into at least one second data block corresponding to the first data block in a conversion result matrix to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
In one or more alternative embodiments, the first data block includes a line of data of the image in one pass.
In an embodiment of the disclosure, each first data block corresponds to one line of one image in one channel, and one image includes the corresponding number of lines of first data blocks in one channel, for example, the image size is H × M × C, where C is the number of channels, M is the height, and H is the width, and every 1 × H data is taken as one line, then the first data block includes H data, and the image includes M first data blocks in one channel, and the image includes M first data blocks. The embodiment of the present disclosure determines that a row of an image in one channel corresponds to a first data block by finding, through research by the inventors, that a row of data of an image in one channel corresponds to a conversion result matrix.
In one or more alternative embodiments, the second data block is a data block with a size R × P, where R is the number of columns of the convolution kernel, and P is the number of horizontal slips in each channel when the convolution kernel is used to perform the convolution operation on the data to be processed.
The inventor finds that the following rules exist between the data to be processed and the conversion result matrix through research: first, the data in one channel of the input image corresponds to a block of data of size S (number of convolution kernel rows) × R (number of convolution kernel columns) × P (number of lateral shifts) × Q (number of longitudinal shifts) in the transformed conversion result matrix. An image includes C channels, each of which corresponds to a block of data of size S x R x P x Q arranged in a vertical direction, thus forming a matrix of C x S x R x P x Q. When the data to be processed includes N images, it is equivalent to expanding in the horizontal direction, so that a matrix of conversion results with size C × S × R × P × Q × N is formed.
Secondly, the inventors have studied and found that a row of data (H pixels) of an image in one channel corresponds to a block of data of size P × R ═ 4 × 3 (number of horizontal shifts × number of convolution kernel columns) in the transformed matrix. Therefore, each line in the input feature tensor can complete a data block with the size of P multiplied by R only by reading once, and the repeated reading of the video memory is avoided.
In addition to the above two points, the inventors found in the research process that each first data block corresponds to at least one second data block in the corresponding S × R × P × Q data blocks in one channel of the image, and optionally, the number of the at least one second data block corresponding to the first data block depends on the corresponding row number of the first data block in the channel to which the first data block belongs.
Optionally, if the number of rows L corresponding to the first data block is less than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
In the embodiment of the present disclosure, the number of the second data blocks corresponding to each first data block is determined, and the number of the second data blocks corresponding to the first data blocks corresponding to different numbers of rows is different, and specifically, the number of the plurality of second data blocks corresponding to the first data blocks may be determined based on the above formula (1). Due to the data reuse relation, a plurality of second data blocks with the size of P multiplied by R can be completed only by reading once for each row of data of the image in each channel in the data to be processed, and further, the repeated reading of the video memory is avoided.
When the at least one second data block corresponding to one first data block is a plurality of second data blocks, optionally, the plurality of second data blocks corresponding to the first data block includes at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
As shown in fig. 5, in one example, R × P second data blocks along the same diagonal in the embodiment of the present disclosure correspond to one first data block, and a right neighbor of one second data block in two adjacent second data blocks corresponding to the same first data block is a lower neighbor of the other second data block. In this manner, each first data block corresponds to a series of R × P sized second data blocks along the diagonal. The number of lines of the image to be processed in each input channel in the data to be processed is Q + S-1.
In one or more optional embodiments, the data processing method provided in the embodiments of the present disclosure further includes:
determining the data reuse relation between at least one second data block and the first data block in the conversion result matrix based on the size of the data to be processed and the size of the convolution kernel;
writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix, including:
and writing the first data block into at least one second data block in the conversion result matrix based on the data reuse relation between the at least one second data block in the conversion result matrix and the first data block.
In the process of implementing the present disclosure, the inventors found that in the process of converting data to be processed (e.g., an image or a feature) into a conversion result matrix, a corresponding relationship between a second data block and a first data block in the data to be processed exists in the obtained conversion result matrix, and each first data block corresponds to at least one second data block.
Optionally, step 1010 includes:
allocating a thread block for each first data block in the data to be processed;
and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
In the embodiment of the present disclosure, the corresponding relationship between the first data block and the second data block is determined based on the obtained data reuse relationship, and the first data block is read by the thread block, alternatively, the data processing provided in the embodiment of the present disclosure may be implemented based on a processing device, for example, taking an image processor (GPU) as an example, a task executed on the GPU is actually a computing task instance called a computing Grid (Grid). A computational grid contains hundreds of thousands of thread blocks, each containing hundreds of threads. In order to increase the processing speed of the data to be processed, it is necessary to increase the utilization rates of the thread blocks and the threads, that is, matrix conversion is performed on the data to be processed by using an appropriate number of thread blocks, and optionally, one thread block is allocated to each first data block in the embodiments of the present disclosure, so that the number of thread blocks required for processing the data to be processed is (Q + S-1) for one image in one channel in the data to be processed, and the number of thread blocks required for processing the data to be processed is (Q + S-1) C × N corresponding to the number of rows of the image to be processed in one channel.
Optionally, allocating a thread block to each first data block in the data to be processed includes:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in the thread blocks;
and each thread block in the plurality of thread blocks is allocated to the first data block corresponding to each thread block.
When a thread block is allocated to each first data block, optionally, according to the number of the thread block, it is calculated to which first data block in the data to be processed the task allocated to the thread block corresponds. Optionally, determining, based on the number of each thread block in the plurality of thread blocks, a corresponding first data block of each thread block in the data to be processed may include: and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the number of the first thread block in the thread blocks, the number C of channels contained in each image in the data to be processed and the number of thread blocks required by each channel. The number of thread blocks required by each channel can be expressed as (Q + S-1). Specifically, if the thread block number is b, the corresponding image in the data to be processed is b/((Q + S-1) × C), where b/((Q + S-1) × C) represents the quotient of b divided by ((Q + S-1) × C), and the quotient represents the number corresponding to the image; determining which image is determined and then determining which channel in the image is by calculating (b% ((Q + S-1) × C))/(Q + S-1), wherein,% represents modulus operation, (b% ((Q + S-1) × C))/(Q + S-1) represents the quotient of the remainder obtained by dividing b by ((Q + S-1) × C)) and then dividing by (Q + S-1), and the quotient represents the channel number corresponding to the image; the number of lines in one channel of the image corresponding to the thread block is ((b% ((Q + S-1) × C))% (Q + S-1)).
Optionally, the method further comprises:
and storing the read first data block into the shared memory by using the thread block allocated to the first data block.
Writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix, including:
reading at least a portion of a first data block stored in the shared memory and writing the read data to a second data block in the conversion result matrix.
When the processing device is used for realizing data processing, due to the characteristics of the processing device, in order to reduce time delay and accelerate convolution processing, the embodiment of the disclosure stores the data to be processed into at least one shared memory through the thread block allocated to each first data block, and the thread block directly reads the data from the shared memory in the calculation process, so that the processing efficiency is improved.
Optionally, each thread block comprises T threads;
before storing the read first data block into the shared memory by using the thread block allocated for the first data block, the method further includes:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
When the size of the shared memory is W, a line of data in one channel in the image is read into the shared memory through T threads in the thread block, and each thread is responsible for reading W/T pixels. Exceptionally, for the case where W/T cannot be divided exactly, where tmp represents the remainder, at which time the remainder tmp is not 0, then the first tmp threads are set to read W/T +1 pixels, at which time W/T represents the quotient of W divided by T, and each of the remaining (T-tmp) threads reads W/T pixels.
Optionally, writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix includes:
and writing the data read by sliding the first data block each time into one row in each second data block corresponding to the first data block.
Optionally, for each shared memory, the feature data in the form of vectors stored in the shared memory is read according to the length P and the step length 1, the data read each time is written into one row in the second data block, the feature data stored in all the shared memories is read into the result matrix, and the convolution calculation result of the feature set to be processed is obtained.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 11 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. The apparatus of this embodiment may be used to implement the method embodiments of the present disclosure described above. As shown in fig. 11, the apparatus of this embodiment includes:
the data reading unit 1101 is configured to read each of a plurality of first data blocks in the data to be processed.
The data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1.
The data conversion unit 1102 is configured to write the first data block into at least one second data block corresponding to the first data block in the conversion result matrix, so as to obtain a conversion result matrix.
Wherein the at least one second data block reuses the first data block.
And the result calculating unit 1103 is configured to perform matrix multiplication on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Based on the data processing device provided by the above embodiment of the present disclosure, a first data block in data to be processed is read, and the first data block is written into at least one second data block corresponding to the first data block in a conversion result matrix, so as to obtain the conversion result matrix, and because of the correspondence between the first data block and the second data block, the speed of obtaining the conversion result matrix is increased, and the difficulty of converting convolution calculation into matrix multiplication is reduced; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Optionally, the first data block comprises one line of data of the image in one channel.
Optionally, the second data block is a data block with a size R × P, where R is the number of columns of the convolution kernel, and P is the number of horizontal sliding in each channel when the convolution kernel is used to perform a convolution operation on the data to be processed.
Optionally, the number of the at least one second data block corresponding to the first data block depends on the number of rows of the first data block corresponding to the channel to which the first data block belongs.
Optionally, if the number of rows L corresponding to the first data block is less than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R;
and if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L.
Optionally, the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block include at least one data block group, where each data block group includes two adjacent second data blocks in the plurality of second data blocks, and a right neighbor data block of one second data block in the data block group is a lower neighbor of another second data block in the data block group.
Optionally, the data processing apparatus provided in the embodiment of the present disclosure further includes:
the data reuse relation determining unit is used for determining the data reuse relation between at least one second data block and the first data block in the conversion result matrix based on the size of the data to be processed and the size of the convolution kernel;
a data conversion unit 1102, configured to write the first data block into at least one second data block in the conversion result matrix based on a data reuse relationship between the at least one second data block in the conversion result matrix and the first data block.
Optionally, the data reading unit 1101 is configured to allocate a thread block to each first data block in the data to be processed; and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
Optionally, the data reading unit 1101, when allocating a thread block to each first data block in the data to be processed, is configured to determine, based on the number of each thread block in the multiple thread blocks, a corresponding first data block in the data to be processed of each thread block; and each thread block in the plurality of thread blocks is allocated to the first data block corresponding to each thread block.
Optionally, when determining, based on the number of each thread block in the plurality of thread blocks, a corresponding first data block in the data to be processed of each thread block, the data reading unit 1101 is configured to determine, based on the number of the first thread block in the plurality of thread blocks, the number of channels C included in each image in the data to be processed, and the number of thread blocks required by each channel, an image corresponding to the first thread block in the data to be processed, a channel in the image, and a row corresponding to the first data block in the channel.
Optionally, the data reading unit 1101 is further configured to store the read first data block in the shared memory by using a thread block allocated to the first data block;
the data conversion unit 1102 is configured to read at least a portion of a first data block stored in the shared memory, and write the read data into a second data block in the conversion result matrix.
Optionally, each thread block comprises T threads;
the data reading unit 1101 is further configured to determine, based on the size W of the shared memory and the number T of threads, data of the first data block read by each thread of the T threads.
Optionally, the data conversion unit 1102 is configured to write the data read each time the first data block is slid into one row in each second data block corresponding to the first data block.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including a processor, where the processor includes the data processing apparatus according to any one of the above embodiments.
According to another aspect of the embodiments of the present disclosure, there is provided an electronic device including: a memory for storing executable instructions;
and a processor for communicating with the memory to execute the executable instructions to perform the operations of the data processing method as provided by any of the above embodiments.
According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided for storing computer-readable instructions, which when executed perform the operations of the data processing method provided in any one of the above embodiments.
According to another aspect of the embodiments of the present disclosure, there is provided a computer program product including computer readable code, when the computer readable code runs on a device, a processor in the device executes instructions for implementing the data processing method provided by any one of the above embodiments.
According to yet another aspect of the embodiments of the present disclosure, another computer program product is provided for storing computer readable instructions which, when executed, cause a computer to perform the operations of the data processing method provided by any of the above embodiments.
The computer program product may be embodied in hardware, software or a combination thereof. In one alternative, the computer program product is embodied in a computer storage medium, and in another alternative, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.
The embodiment of the disclosure further provides a data processing method and device, an electronic device, a computer storage medium, and a computer program product, wherein based on the size of the data to be processed and the size of the convolution kernel, a data reuse relationship of at least one second data block in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed is determined, and based on the data reuse relationship of the at least one second data block in the conversion result matrix to the first data block in the data to be processed, the data to be processed is converted to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
It is to be understood that the terms "first," "second," and the like in the embodiments of the present disclosure are used for distinguishing and not limiting the embodiments of the present disclosure.
It is also understood that in the present disclosure, "plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in this disclosure is generally to be construed as one or more, unless explicitly stated otherwise or indicated to the contrary hereinafter.
It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
The embodiment of the disclosure also provides an electronic device, which may be a mobile terminal, a Personal Computer (PC), a tablet computer, a server, and the like. Referring now to fig. 12, shown is a schematic diagram of an electronic device 1200 suitable for use in implementing a terminal device or server of an embodiment of the disclosure: as shown in fig. 12, the electronic device 1200 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPU)1201, and/or one or more image processors (GPU)1213, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM)1202 or loaded from a storage portion 1208 into a Random Access Memory (RAM) 1203. The communication portion 1212 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card.
The processor may communicate with the read-only memory 1202 and/or the random access memory 1203 to execute an executable instruction, connect with the communication unit 1212 through the bus 1204, and communicate with other target devices through the communication unit 1212, so as to complete operations corresponding to any method provided by the embodiments of the present disclosure, for example, determine, based on the size of the data to be processed and the size of the convolution kernel, a data reuse relationship of at least one second data block in a transformation result matrix corresponding to the data to be processed with respect to a first data block in the data to be processed, perform transformation processing on the data to be processed based on the data reuse relationship of the at least one second data block in the transformation result matrix with respect to the first data block in the data to be processed, and obtain a transformation result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed. Or reading each first data block in a plurality of first data blocks in the data to be processed, and writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed.
Further, in the RAM1203, various programs and data necessary for the operation of the device may also be stored. The CPU1201, ROM1202, and RAM1203 are connected to each other by a bus 1204. In the case of the RAM1203, the ROM1202 is an optional module. The RAM1203 stores or writes executable instructions into the ROM1202 at runtime, and the executable instructions cause the central processing unit 1201 to perform operations corresponding to the above-described communication methods. An input/output (I/O) interface 1205 is also connected to bus 1204. The communication unit 1212 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.
The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.
It should be noted that the architecture shown in fig. 12 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 12 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, GPU1213 and CPU1201 may be separately provided or GPU1213 may be integrated on CPU1201, the communication part may be separately provided, or may be integrated on CPU1201 or GPU1213, etc. These alternative embodiments are all within the scope of the present disclosure.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, an embodiment of the present disclosure includes a computer program product including a computer program tangibly embodied on a machine-readable medium, where the computer program includes a program code for executing a method shown in the flowchart, where the program code may include instructions corresponding to executing steps of the method provided in the embodiment of the present disclosure, for example, determining a data reuse relationship of at least one second data block in a transformation result matrix corresponding to data to be processed with respect to a first data block in the data to be processed based on a size of the data to be processed and a size of a convolution kernel, and performing transformation processing on the data to be processed based on the data reuse relationship of the at least one second data block in the transformation result matrix with respect to the first data block in the data to be processed to obtain a transformation result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed. Or reading each first data block in a plurality of first data blocks in the data to be processed, and writing the first data block into at least one second data block corresponding to the first data block in the conversion result matrix to obtain a conversion result matrix; and carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The operations of the above-described functions defined in the method of the present disclosure are performed when the computer program is executed by a Central Processing Unit (CPU) 1201.
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
The description of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to practitioners skilled in this art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (37)

1. A data processing method, comprising:
determining a data reuse relation of at least one second data block in a conversion result matrix corresponding to the data to be processed to a first data block in the data to be processed based on the size of the data to be processed and the size of a convolution kernel, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
performing conversion processing on the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to a first data block in the data to be processed to obtain the conversion result matrix;
performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed;
the first data block comprises a line of data of the image in one channel;
the second data block is a data block with a size of R multiplied by P, wherein R is the number of columns of the convolution kernel, and P is the number of times of transverse sliding in each channel when the convolution kernel is used for performing convolution operation on the data to be processed;
the number of the at least one second data block corresponding to the first data block depends on the corresponding row number of the first data block in the channel to which the first data block belongs;
if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R; q is the number of times of longitudinal sliding in each channel when the convolution core is used for performing convolution operation on the data to be processed;
if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L; wherein S is the number of rows of the convolution kernel.
2. The method of claim 1, wherein the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block includes at least one data block group, wherein each data block group includes two adjacent second data blocks of the plurality of second data blocks, and wherein a right neighbor data block of one second data block of the data block group is a lower neighbor of another second data block of the data block group.
3. The method of claim 1, wherein the converting the data to be processed based on the data reuse relationship of at least one second data block in the conversion result matrix to a first data block in the data to be processed to obtain the conversion result matrix comprises:
allocating a thread block to each first data block in the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to the first data block in the data to be processed;
reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block, and writing the read first data block into at least one second data block which has reuse relation to the first data block in the conversion result matrix.
4. The method of claim 3, wherein the allocating a thread block for each first data block in the to-be-processed data based on a data reuse relationship of at least one second data block in the conversion result matrix to the first data block in the to-be-processed data comprises:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks;
and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
5. The method of claim 4, wherein determining the corresponding first data block of each thread block in the data to be processed based on the number of each thread block in the plurality of thread blocks comprises:
and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the serial number of the first thread block in the thread blocks, the number of channels C contained in each image in the data to be processed and the number of thread blocks required by each channel.
6. The method of claim 3, further comprising:
storing the read first data block into a shared memory by using the thread block allocated to the first data block;
the writing the read first data block into at least one second data block in a conversion result matrix having a reuse relationship with the first data block comprises:
reading at least a portion of the first data block stored in the shared memory, and writing the read data into a second data block in the conversion result matrix.
7. The method of claim 6, wherein each of the thread blocks comprises T threads;
before storing the read first data block into the shared memory by using the thread block allocated to the first data block, the method further includes:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
8. The method of any one of claims 3 to 7, wherein the reading the first data block from the data to be processed by using the thread block allocated to the first data block and writing the read first data block into at least one second data block having a reuse relationship with respect to the first data block in the transformation result matrix comprises:
and performing sliding reading on the first data block through the thread block allocated to the first data block, and writing the data read each time into one row of each second data block corresponding to the first data block.
9. A data processing method, comprising:
reading each first data block in a plurality of first data blocks in data to be processed, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix to obtain the conversion result matrix, wherein the at least one second data block reuses the first data block;
performing matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed;
the first data block comprises a line of data of the image in one channel;
the second data block is a data block with a size of R multiplied by P, wherein R is the number of columns of the convolution kernel, and P is the number of times of transverse sliding in each channel when the convolution kernel is used for performing convolution operation on the data to be processed;
the number of the at least one second data block corresponding to the first data block depends on the corresponding row number of the first data block in the channel to which the first data block belongs;
if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R; q is the number of times of longitudinal sliding in each channel when the convolution core is used for performing convolution operation on the data to be processed;
if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L; wherein S is the number of rows of the convolution kernel.
10. The method of claim 9, wherein the at least one second data block is a plurality of second data blocks, and the plurality of second data blocks corresponding to the first data block includes at least one data block group, wherein each data block group includes two adjacent second data blocks of the plurality of second data blocks, and wherein a right neighbor data block of one second data block of the data block group is a lower neighbor of another second data block of the data block group.
11. The method of claim 9, wherein before writing the first data block into at least one second data block corresponding to the first data block in a transformation result matrix, obtaining the transformation result matrix, the method further comprises:
determining a data reuse relationship between the at least one second data block and the first data block in the conversion result matrix based on the size of the data to be processed and the size of the convolution kernel;
the writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix includes:
writing the first data block into the at least one second data block in the conversion result matrix based on a data reuse relationship of the at least one second data block in the conversion result matrix with the first data block.
12. The method of claim 9, wherein reading each of the plurality of first data blocks in the data to be processed comprises:
allocating a thread block to each first data block in the data to be processed;
and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
13. The method of claim 12, wherein said allocating a thread block to each of the first data blocks in the data to be processed comprises:
determining a first data block corresponding to each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks;
and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
14. The method of claim 13, wherein determining a corresponding first data block of each thread block in the pending data based on the number of each thread block in the plurality of thread blocks comprises:
and determining an image corresponding to the first thread block in the data to be processed, a channel in the image and a row corresponding to the first data block in the channel based on the serial number of the first thread block in the thread blocks, the number of channels C contained in each image in the data to be processed and the number of thread blocks required by each channel.
15. The method of claim 12, further comprising:
storing the read first data block into a shared memory by using the thread block allocated to the first data block;
the writing the first data block into at least one second data block corresponding to the first data block in a conversion result matrix includes:
reading at least a portion of the first data block stored in the shared memory, and writing the read data into a second data block in the conversion result matrix.
16. The method of claim 15, wherein each of the thread blocks comprises T threads;
before storing the read first data block into the shared memory by using the thread block allocated to the first data block, the method further includes:
and determining the data of the first data block read by each thread in the T threads based on the size W of the shared memory and the number T of the threads.
17. The method according to any one of claims 9 to 16, wherein writing the first data block into at least one second data block corresponding to the first data block in the transformation result matrix comprises:
and writing the data read by sliding the first data block each time into one row of each second data block corresponding to the first data block.
18. A data processing apparatus, comprising:
a data reuse relation determining unit, configured to determine a data reuse relation between at least one second data block in a conversion result matrix corresponding to data to be processed and a first data block in the data to be processed based on a size of the data to be processed and a size of a convolution kernel, where the data to be processed includes N images, each of the images includes C channels, and at least one of N and C is an integer greater than 1;
the data conversion unit is used for performing conversion processing on the data to be processed based on the data reuse relation of at least one second data block in the conversion result matrix to a first data block in the data to be processed to obtain the conversion result matrix;
the result calculation unit is used for carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed;
the first data block comprises a line of data of the image in one channel;
the second data block is a data block with a size of R multiplied by P, wherein R is the number of columns of the convolution kernel, and P is the number of times of transverse sliding in each channel when the convolution kernel is used for performing convolution operation on the data to be processed;
the number of the at least one second data block corresponding to the first data block depends on the corresponding row number of the first data block in the channel to which the first data block belongs;
if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R; q is the number of times of longitudinal sliding in each channel when the convolution core is used for performing convolution operation on the data to be processed;
if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L; wherein S is the number of rows of the convolution kernel.
19. The apparatus of claim 18, wherein the at least one second data block is a plurality of second data blocks, and wherein the plurality of second data blocks corresponding to the first data block includes at least one data block group, wherein each data block group includes two adjacent second data blocks of the plurality of second data blocks, and wherein a right neighbor data block of one second data block of the data block group is a lower neighbor of another second data block of the data block group.
20. The apparatus of claim 18, wherein the data conversion unit is configured to allocate a thread block to each first data block in the to-be-processed data based on a data reuse relationship between at least one second data block in the conversion result matrix and the first data block in the to-be-processed data; reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block, and writing the read first data block into at least one second data block which has reuse relation to the first data block in the conversion result matrix.
21. The apparatus of claim 20, wherein the data conversion unit, when allocating a thread block to each first data block in the data to be processed based on a data reuse relationship between a plurality of second data blocks in the conversion result matrix and the first data block in the data to be processed, is configured to determine, based on a number of each thread block in a plurality of thread blocks, a corresponding first data block in the data to be processed of the each thread block; and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
22. The apparatus according to claim 21, wherein the data conversion unit, when determining a corresponding first data block of each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks, is configured to determine an image corresponding to the first thread block in the data to be processed, a channel in the image, and a row corresponding to the first data block in the channel based on the number of the first thread block in the plurality of thread blocks, the number of channels C included in each image in the data to be processed, and the number of thread blocks required by each channel.
23. The apparatus of claim 20, further comprising:
a data reading unit, configured to store the read first data block in a shared memory by using a thread block allocated to the first data block;
the data conversion unit is configured to read at least a portion of the first data block stored in the shared memory, and write the read data into a second data block in the conversion result matrix.
24. The apparatus of claim 23, wherein each of the thread blocks comprises T threads;
the data reading unit is further configured to determine, based on the size W of the shared memory and the number T of threads, data of the first data block read by each thread of the T threads.
25. The apparatus according to any one of claims 20 to 24, wherein the data conversion unit, when reading the first data block from the data to be processed by using the thread block allocated to the first data block and writing the read first data block into at least one second data block in the conversion result matrix, which has a reuse relationship with respect to the first data block, is specifically configured to perform sliding reading on the first data block by using the thread block allocated to the first data block and write the data read each time into one row of each of the second data blocks corresponding to the first data block.
26. A data processing apparatus, comprising:
the data reading unit is used for reading each first data block in a plurality of first data blocks in data to be processed, wherein the data to be processed comprises N images, each image comprises C channels, and at least one of N and C is an integer greater than 1;
a data conversion unit, configured to write the first data block into at least one second data block corresponding to the first data block in a conversion result matrix, to obtain the conversion result matrix, where the at least one second data block reuses the first data block;
the result calculation unit is used for carrying out matrix multiplication operation on the conversion result matrix and the convolution kernel to obtain a convolution result of the data to be processed;
the first data block comprises a line of data of the image in one channel;
the second data block is a data block with a size of R multiplied by P, wherein R is the number of columns of the convolution kernel, and P is the number of times of transverse sliding in each channel when the convolution kernel is used for performing convolution operation on the data to be processed;
the number of the at least one second data block corresponding to the first data block depends on the corresponding row number of the first data block in the channel to which the first data block belongs;
if the number of rows L corresponding to the first data block is smaller than the number of columns R of the convolution kernel, the number of second data blocks corresponding to the first data block is L;
if the line number L corresponding to the first data block is greater than or equal to R and less than or equal to Q, the number of the second data blocks corresponding to the first data block is R; q is the number of times of longitudinal sliding in each channel when the convolution core is used for performing convolution operation on the data to be processed;
if the line number L corresponding to the first data block is larger than Q, the number of the second data blocks corresponding to the first data block is Q + S-1-L; wherein S is the number of rows of the convolution kernel.
27. The apparatus of claim 26, wherein the at least one second data block is a plurality of second data blocks, and wherein the plurality of second data blocks corresponding to the first data block includes at least one data block group, wherein each data block group includes two adjacent second data blocks of the plurality of second data blocks, and wherein a right neighbor data block of one second data block of the data block group is a lower neighbor of another second data block of the data block group.
28. The apparatus of claim 26, further comprising:
a data reuse relation determining unit, configured to determine a data reuse relation between the at least one second data block and the first data block in the conversion result matrix based on a size of data to be processed and a size of the convolution kernel;
the data conversion unit is configured to write the first data block into the at least one second data block in the conversion result matrix based on a data reuse relationship between the at least one second data block in the conversion result matrix and the first data block.
29. The apparatus according to claim 26, wherein the data reading unit is configured to allocate a thread block to each of the first data blocks in the data to be processed; and reading the first data block from the data to be processed by utilizing the thread block allocated to the first data block.
30. The apparatus according to claim 29, wherein the data reading unit, when allocating a thread block to each of the first data blocks in the data to be processed, is configured to determine, based on the number of each of a plurality of thread blocks, a corresponding first data block in the data to be processed of each thread block; and allocating each thread block in the plurality of thread blocks to a first data block corresponding to each thread block.
31. The apparatus according to claim 30, wherein the data reading unit, when determining a corresponding first data block of each thread block in the data to be processed based on the number of each thread block in a plurality of thread blocks, is configured to determine an image corresponding to the first thread block in the data to be processed, a channel in the image, and a row corresponding to the first data block in the channel based on the number of the first thread block in the plurality of thread blocks, the number of channels C included in each image in the data to be processed, and the number of thread blocks required by each channel.
32. The apparatus according to claim 29, wherein the data reading unit is further configured to store the read first data block in a shared memory by using a thread block allocated to the first data block;
the data conversion unit is configured to read at least a portion of the first data block stored in the shared memory, and write the read data into a second data block in the conversion result matrix.
33. The apparatus of claim 32, wherein each of said thread blocks comprises T threads;
the data reading unit is further configured to determine, based on the size W of the shared memory and the number T of threads, data of the first data block read by each thread of the T threads.
34. The apparatus according to any one of claims 26 to 33, wherein the data conversion unit is configured to write the data read by sliding the first data block each time into one row of each of the second data blocks corresponding to the first data block.
35. An electronic device, comprising a processor including the data processing apparatus of any one of claims 18 to 34.
36. An electronic device, comprising: a memory for storing executable instructions;
and a processor in communication with the memory for executing the executable instructions to perform the operations of the data processing method of any one of claims 1 to 17.
37. A computer-readable storage medium storing computer-readable instructions that, when executed, perform the operations of the data processing method of any of claims 1 to 17.
CN201910164371.6A 2019-03-05 2019-03-05 Data processing method and device, electronic equipment and storage medium Active CN109885407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910164371.6A CN109885407B (en) 2019-03-05 2019-03-05 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910164371.6A CN109885407B (en) 2019-03-05 2019-03-05 Data processing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109885407A CN109885407A (en) 2019-06-14
CN109885407B true CN109885407B (en) 2021-09-21

Family

ID=66930763

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910164371.6A Active CN109885407B (en) 2019-03-05 2019-03-05 Data processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109885407B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11861485B2 (en) * 2019-11-22 2024-01-02 Baidu Usa Llc Data format transform method to improve AI engine MAC utilization
CN111125617A (en) * 2019-12-23 2020-05-08 中科寒武纪科技股份有限公司 Data processing method, data processing device, computer equipment and storage medium
CN111597029B (en) * 2020-05-20 2024-03-22 上海商汤智能科技有限公司 Data processing method and device, electronic equipment and storage medium
CN112927124A (en) * 2021-03-31 2021-06-08 成都商汤科技有限公司 Data processing method, device, equipment and storage medium
CN115878072A (en) * 2021-09-26 2023-03-31 中科寒武纪科技股份有限公司 Computing device and method for executing binary operation of multidimensional data and related products
CN116304750B (en) * 2023-05-19 2023-08-18 北京算能科技有限公司 Data processing method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8422821B2 (en) * 2008-12-19 2013-04-16 International Business Machines Corporation Selectively transforming a multi-dimensional array
US8340458B2 (en) * 2011-05-06 2012-12-25 Siemens Medical Solutions Usa, Inc. Systems and methods for processing image pixels in a nuclear medicine imaging system
CN104077233B (en) * 2014-06-18 2017-04-05 百度在线网络技术(北京)有限公司 Multichannel convolutive layer treating method and apparatus
CN107729989B (en) * 2017-07-20 2020-12-29 安徽寒武纪信息科技有限公司 Device and method for executing artificial neural network forward operation
CN107993186B (en) * 2017-12-14 2021-05-25 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108171327A (en) * 2017-12-25 2018-06-15 郑州云海信息技术有限公司 A kind of matrix method for transformation, device and medium based on convolution algorithm

Also Published As

Publication number Publication date
CN109885407A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
CN109885407B (en) Data processing method and device, electronic equipment and storage medium
US8854383B2 (en) Pixel value compaction for graphics processing
CN109919311B (en) Method for generating instruction sequence, method and device for executing neural network operation
Kong et al. Accelerating MATLAB image processing toolbox functions on GPUs
TW201942808A (en) Deep learning accelerator and method for accelerating deep learning operations
JP6713036B2 (en) Method and apparatus for performing a convolution operation on folded feature data
US20170206089A1 (en) Information processing apparatus and computational method
CN106251392A (en) For the method and apparatus performing to interweave
CN114026569A (en) Extended convolution using systolic arrays
WO2018214769A1 (en) Image processing method, device and system
KR20100112162A (en) Methods for fast and memory efficient implementation of transforms
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
CN112991142B (en) Matrix operation method, device, equipment and storage medium for image data
CN112162854A (en) Method, system and medium for scheduling calculation tasks between CPU-GPU
JP6907700B2 (en) Information processing device, multi-thread matrix operation method, and multi-thread matrix operation program
Steinbach et al. Accelerating batch processing of spatial raster analysis using GPU
JP2022550170A (en) Method, device, medium and equipment for computer to realize computation of tensor data
Wu et al. Image autoregressive interpolation model using GPU-parallel optimization
CN111898081A (en) Convolution operation method and convolution operation device
KR101688435B1 (en) Apparatus and Method of Generating Integral Image using Block Structure
US10387997B2 (en) Information processing device, information processing method, and storage medium
CN111884658A (en) Data decompression method, data compression method and convolution operation device
Wu et al. From coarse-to fine-grained implementation of edge-directed interpolation using a GPU
CN108765259B (en) Hyperspectral image RATGP and ROSP parallel optimization method based on GPU
CN117473212B (en) GPU acceleration method, device, equipment and storage medium of NTT algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant