CN115114575B - Vector processor-oriented image-to-matrix row conversion method, device and medium - Google Patents
Vector processor-oriented image-to-matrix row conversion method, device and medium Download PDFInfo
- Publication number
- CN115114575B CN115114575B CN202211043942.9A CN202211043942A CN115114575B CN 115114575 B CN115114575 B CN 115114575B CN 202211043942 A CN202211043942 A CN 202211043942A CN 115114575 B CN115114575 B CN 115114575B
- Authority
- CN
- China
- Prior art keywords
- vector
- matrix
- target image
- image
- vector processor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F13/00—Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
- G06F13/14—Handling requests for interconnection or transfer
- G06F13/20—Handling requests for interconnection or transfer for access to input/output bus
- G06F13/28—Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/60—Memory management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Image Processing (AREA)
Abstract
The application discloses a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and relates to the technical field of image processing. The method comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing (im 2 row) from an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. According to the method for performing the im2row operation in the vector processor, data are transmitted in a multi-dimensional array form in the transmission process through the DMA operation, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation can be improved.
Description
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor.
Background
The image-to-matrix row conversion operation (im 2 row) is one of the important operations for the general convolution implementation in deep learning. On the vector processor, the conversion from the image with any size to the matrix row is efficiently realized, the general convolution performance on the vector processor can be effectively improved, and the convolution operation and the neural network type supported by the vector processor are greatly expanded.
Image-to-matrix row conversion operations are typically memory-intensive operations, so optimizing memory performance in im2row implementations is critical to speeding up im2 row. On a vector processor, there are generally two methods of implementation: the first is directly through scalar operations on a vector processor; the second is a Direct Memory Access (DMA) operation of the vector processor, which is implemented by Direct conversion on a Double Data Rate Dynamic Random Access (DDR) external to the vector processor. The performance of a DMA operation is directly related to the size of the block during its transfer, and generally speaking, the larger the block, the more efficiently the DMA component can exploit the memory bandwidth performance of the vector processor. For the first scalar operation, the operation equivalent to DMA is 1 element. For the second approach, the block size is typically 1 element or L elements, where L represents the data width that the vector processing unit processes in parallel. As described above, the block size of the DMA operation in the two common implementations is small, so that the memory access performance of the vector processor cannot be effectively exerted.
Therefore, the problem to be solved by the skilled person is how to efficiently implement im2row so as to improve the performance of the image-to-matrix row conversion operation on the vector processor.
Disclosure of Invention
The application aims to provide an image-to-matrix row conversion method, an image-to-matrix row conversion device and a medium for a vector processor, which are used for improving the image-to-matrix row conversion operation performance of the vector processor.
In order to solve the above technical problem, the present application provides a method for converting an image into a matrix row for a vector processor, including:
acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;
invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space;
performing vectorization conversion processing of an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;
and invoking the direct memory access operation to store the converted matrix from the array memory space back into the double-rate synchronous dynamic random access memory.
Preferably, the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
and calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space.
Preferably, the blocking the target image so as to obtain each first block includes:
blocking the target image according to the array memory space parameters so as to obtain each first block; the array memory space parameters at least comprise index parameters corresponding to all dimensions.
Preferably, the vector reading part and the vector writing part of the vector processor are used for carrying out vectorization conversion processing of the target image to matrix rows and obtaining a converted matrix, and the vectorization conversion processing comprises the following steps:
blocking the first tile in the array memory space to obtain a second tile;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a set of vector registers using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
and acquiring the converted matrix.
Preferably, said blocking said first partition in said array memory space to obtain a second partition comprises:
and partitioning the first partition in the array memory space according to the array memory space parameters and the parameters of convolution calculation so as to obtain the second partition.
Preferably, the width of the second partition is less than or equal to the total number of vector registers.
Preferably, before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged at the backThe sub-matrix data stored in the vector register is copied to the vector register group in sequence before being copiedA plurality of said vector registers; wherein saidIs the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into the last m vector registers in the vector register set by using the vector reading part of the vector processor, and entering the step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form the converted matrix; wherein the size of m is the width of the second partitionAnd is as described aboveThe difference of (a).
In order to solve the above technical problem, the present application further provides an image-to-matrix row conversion apparatus facing a vector processor, including:
the device comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;
a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space;
the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;
and the storage module is used for calling the direct memory access operation to store the converted matrix from the array memory space back to the double-rate synchronous dynamic random access memory.
In order to solve the above technical problem, the present application further provides an image-to-matrix row conversion apparatus facing a vector processor, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the image-to-matrix row conversion method facing the vector processor when executing the computer program.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above image-to-matrix row conversion method for a vector processor.
The method for converting the image facing to the vector processor into the matrix row comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. Compared with the prior method for directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and the method for performing im2row operation in the vector processor provided by the application has the advantages that data are transmitted in a multidimensional array form through the DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is remarkably increased, the number of times of the DMA operation is greatly reduced, and therefore the storage bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is remarkably improved.
In addition, the application also provides an image-to-matrix row conversion device facing the vector processor and a computer readable storage medium, which correspond to the above mentioned method for converting the image facing the vector processor into the matrix row, and the effects are the same.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.
Fig. 1 is a block diagram of a vector processor according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application;
FIG. 3 is a block diagram provided in an embodiment of the present application;
fig. 4 is a flowchart of a vectorization processing method for implementing vectorization conversion processing from an input image to a matrix row in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application;
FIG. 5 is a flow chart of a method for completing a transformation provided by an embodiment of the present application;
FIG. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application;
FIG. 7 is a block diagram of an image-to-matrix row conversion apparatus for a vector processor according to an embodiment of the present application;
fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and the vector processor-oriented medium is used for improving the performance of the image-to-matrix row conversion on the vector processor.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. Fig. 1 is a block diagram of a vector processor according to an embodiment of the present application. As shown in fig. 1, the Vector processor includes a Scalar Processing Unit (SPU) that performs Scalar operations, a Vector Processing Unit (VPU) that performs Vector operations, a Direct Memory Access (DMA) component that is responsible for data transfer, and the like. The SPU is composed of a Scalar Processing Element (SPE) and a Scalar Memory (SM). The VPU is composed of J Vector Processing Elements (VPEs) and an Array Memory (AM), and the J Vector Processing Elements (VPEs) cooperatively operate in a Single Instruction Multiple Data (SIMD) manner, support the turning-off and turning-on of a specific VPE Element, but do not support Data interaction between Multiple VPEs. A single VPE can process 18 bytes of data (e.g., FP64, int 64) or 24 bytes of data (e.g., FP32, int 32) at a time. The DMA component is responsible for data transfer between SM and DDR, AM and DDR, with a minimum granularity of operation of also 8 bytes. Fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application, where as shown in fig. 2, the method includes:
s1: and acquiring a target image and storing the target image into the DDR.
The target image is an image that needs convolution calculation. Storing the convolution computed input feature map (input image) data matrix in the DDR, labeledStored data layout In whichWhich is indicative of the size of the batch process,andrepresenting the height and width of the input feature map,indicating the data width of the parallel processing by the vector processing unit (when the data type is FP64, int64, L = J; when the data type is FP32, int32,),representing the number of blocks of a channel on the input profile, the number of channels of the input profile being. The height and width of the convolution kernel in the convolution calculation are respectivelyAndstep size S, fill size. Thus, the final result matrix of im2row operations is stored in the DDR, labeledThe data layout is Wherein, in the step (A),、respectively representing the height and width of the output characteristic diagram of the convolution calculation, and the calculation formulas are respectively shown as formulas (1) and (2), whereinMeaning rounding down.
S2: and calling DMA operation to load the target image from the DDR to the AM space.
In order to improve the access performance of the vector processor, the im2row operation which is performed off chip is converted into the operation which is performed on chip, and the DMA operation is called to load the target image from the DDR to the AM space.
S3: performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor.
Since the Load component (i.e., vector read component) and the Store component (i.e., vector write component) of the vector processor can implement the vectorization conversion process of the image into matrix rows, im2row operation is performed on the AM space by using the Load component and the Store component, resulting in a converted matrix.
S4: the DMA operation is invoked to store the translated matrix from the AM space back into the DDR.
In order to improve the access performance of the vector processor, the target image is loaded from the DDR to the AM space through the DMA operation, and after im2row operation is performed on the AM space, the converted matrix needs to be stored from the AM space back to the DDR. In storing from AM space back to DDR, this is still accomplished by invoking DMA operations.
The method for converting an image to a matrix row facing a vector processor provided by the embodiment comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load a target image from the DDR to an AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the translated matrix from the AM space back into the DDR. Compared with the prior method of directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and according to the method of performing im2row operation in a vector processor provided by this embodiment, data is transmitted in a multidimensional array form through DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is significantly increased, and the number of times of the DMA operation is greatly reduced, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is significantly improved.
In implementation, in order to load the target image from the DDR to the AM space, preferably, the method includes the following steps:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first partition;
and calling DMA operation to load the sub-matrix data corresponding to each first partition from the DDR to the corresponding position of the AM space.
Since the AM space is limited, the target image needs to be loaded from the DDR to the AM space as a plurality of first blocks. Marking AM space sizeThe original block input feature icon of each time the AM space is transmitted is marked asSize mark is(ii) a After being transmitted into AM space, the AM space is expanded intoIs marked as(ii) a The im2row result produced was of the sizeIs marked byIn which,. Due to the limitation of AM space sizeAnd withThe total space size does not exceedNamely:
whereinAnd (3) representing the number of bytes of the memory space occupied by each element in the convolution calculation, wherein t =4 if the convolution calculation adopts a single-precision floating point number FP 32. Thus, im is determinedBlock size of 2row operation outgoing AM spaceAnd。
the embodiment of the application provides a method for loading one block in a target image from a DDR to an AM space by calling DMA operation. The method comprises the following steps:
Where N represents an index in the N dimension;
Wherein min represents the minimum of the two numbers;
Wherein min represents the minimum of the two numbers;
S220: invoking a Direct Memory Access (DMA) operation willThe size of the position isInto on-chip AM spaceAt the location.
To be provided with、、、For example, table 1 shows the AM spaceAssuming that it is in AM space at this timeThe partial data of (a) are as follows:
When the target image is loaded to the AM space from the DDR by calling the DMA operation provided by the embodiment, the target image is blocked into a plurality of first blocks, and the target image can be completely loaded to the AM space by each block; the vector processor can load each block from DDR to AM space in order through the block processing.
In implementation, when loading the target image from the DDR to the AM space, in order to reasonably block the target image, it is preferable that the blocking the target image so as to obtain each first block includes:
blocking the target image according to the AM space parameters so as to obtain first blocks; the AM space parameters at least comprise index parameters corresponding to all dimensions.
When the target image is blocked, the target image can be reasonably blocked according to the parameters of the AM space. The parameters in the AM space at least include index parameters corresponding to each dimension. The procedure shown in the method for loading a block in a target image from a DDR to an AM space by calling a DMA operation in the above embodiment is according to、The corresponding index parameter is finally obtainedDimension, degree,The size of the partition in dimension.
The target image is loaded from DDR to the AM space according to the parameters on the AM space, so that the target image can be reasonably partitioned.
In implementation, in order to facilitate im2row operation on the first partition, it is preferable that performing vectorization conversion processing of an image of a target image into matrix rows using a Load component and a Store component of a vector processor and acquiring a converted matrix includes:
partitioning the first partition in an AM space to obtain a second partition;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each second partition into a vector register set by using a Load component of the vector processor;
storing, by a Store component of the vector processor, sub-matrix data corresponding to each second partition in the vector register set at a corresponding location to form a converted matrix;
and acquiring the converted matrix.
In thatIsThe blocks are divided in dimension, each block is called a line input window WinInput, and the width of each block is marked as WinInputBetween two adjacent line input windowsOverlap of size width wherein. Outputting convolution corresponding to a line input windowDimensional block size markingHaving a value of. In view ofMust be less than or equal to the total number of available vector registers in the vector processorAnd anMust be a limitation of integers, i.e.While satisfyingWhereinThe remainder is expressed. Thereby determiningThe value of (c).
To be provided with、、、For the purpose of example, it is preferred that,taking out. When in useThe blocks are shown in fig. 3. Fig. 3 is a block diagram provided in an embodiment of the present application. From a0 to a5 is a first line input window, from a4 to a9 is a second line input window, and a4 and a5 are overlapping portions of the first and second line input windows. Fig. 4 is a flowchart of a processing method for implementing vectorization conversion from an input image to matrix rows in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application. As shown in fig. 4, the method includes:
s34: by vector processorsA Load component for inputting the feature map of the blocks in the AM spaceThe size of the position isSequentially loading the submatrix data into a vector register set.
To be provided with、、、、For example, when、In order to doThe data of the size is loaded into the vector register sets VR0, VR1, VR2, VR3, VR4, VR5 at a time, and the result is shown in Table 2 below, where Table 2 is the data to be loaded intoThe size data is loaded into the data of the vector register set at once.
As can be seen from table 2, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a10, a11, a12, a13, a14, a15, respectively.
S35: storing data in a vector register set by a Store component of a vector processorThe corresponding position of (a).
The embodiment of the application provides a method for storing data in a vector register setThe method of (a), the method comprising:
S3510: judgment ofWhether or not less thanIf yes, the process returns to step S3506, and if no, step S3511 is executed.
S3512: judgment ofWhether or not less thanIf yes, returning to the step S3503, otherwise, executing the step S3513;
S3514: judgment ofIf the value is greater than or equal to 0, jumping to step S3502 if the value is greater than or equal to 0, and otherwise, executing step S3515;
s3516: judgment ofWhether or not less thanIf yes, go to step S3517; if not, jumping to the step S3518;
s3517: updating the data of the vector register group and returning to the step S3501;
S3519: judgment ofWhether or not less thanIf yes, go to step S33 of fig. 4; if not, the step S41 is executed;
s41: invoking Direct Memory Access (DMA) operations into AM spaceThe size of the position isTo double rate synchronous dynamic random access memoryTo (3).
To be provided with、、、、For example, when、When the data is needed, the data loaded to VR0, VR1, VR2, VR3, VR4 and VR5 are stored inAfter corresponding position, itThe data in (1) are as follows: TABLE 3 storage of data in vector register setInFor the partial data of the 0 position, it should be noted that "-" in table 3 indicates data irrelevant to this storing operation.
Since im2row operation is performed on each partition in sequence in the present application, after the converted current matrix is stored from the AM space back to the DDR, it is necessary to determine whether processing is completed on each partition. Fig. 5 is a flowchart of a method for completing a conversion according to an embodiment of the present application. As shown in fig. 5, the method includes:
s42: completing the conversion of the current block;
S46: judgment ofWhether or not less thanIf yes, returning to the step S204, otherwise, executing the step S47;
S48: judgment ofWhether or not less thanIf yes, returning to step S203, otherwise, executing step S49;
S50: judgment ofIf the value is less than N, returning to the step S202 if the value is less than N, otherwise, executing the step S51;
s51: the conversion is completed.
In the method for performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix provided by this embodiment, the first block data is blocked again, and im2row operations are performed respectively, so that the memory access performance of the image processor is greatly improved.
In implementation, in order to be able to more accurately perform blocking again on the first partition, it is a preferred embodiment that blocking the first partition in the AM space so as to obtain the second partition includes:
and partitioning the first partition in the AM space according to the AM space parameters and the convolution calculation parameters so as to obtain a second partition.
In the above embodiment, the second partitions have been described, and the row input windows corresponding to the second partitions are obtainedTherefore, the method of partitioning the first partition again in the AM space according to the AM space parameters and the convolution calculated parameters will not be described again here.
The second partition obtained by blocking each first partition again provided by this embodiment performs im2row operation, so as to improve the memory access performance of the vector processor.
In implementation, since the number of vector registers that can be used in vector processing is limited, in order to enable the second partition to be processed by the available vector registers, it is preferred that the width of the second partition is less than or equal to the total number of vector registers.
The width of the second partition provided by the present embodiment is less than or equal to the total number of vector registers, so that the data of the second partition can be processed by the vector register bank.
When updating the data of the vector register, the preferred embodiment further includes, before storing, by the Store component of the vector processor, the sub-matrix data corresponding to each second partition in the vector register set in the corresponding position to form the converted matrix:
under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged in the later stageBefore the sub-matrix data stored in the vector register is copied to the vector register group in sequenceA vector register; whereinThe size of (b) is the size of the width overlap of two adjacent second blocks;
sequentially loading the sub-matrix data corresponding to each second partition into the m vector registers in the vector register group by using a Load component of the vector processor, and storing the sub-matrix data corresponding to each second partition in the vector register group at a corresponding position by using a Store component of the vector processor to form a converted matrix; wherein m is the width of the second partitionAndthe difference of (c).
In implementation, the specific process for updating the data of the vector register set of step S3517 is shown in fig. 6. Fig. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application, where the method includes:
s35171: will be provided withThe data of the vector register is copied to the front in sequenceA register;
to be provided with、、、、For the purpose of example only,when is coming into contact with、In order to do soThe data in the vector register is copied to the frontThe results of the registers are shown in Table 4, and Table 4 is the followingThe data in the vector register is copied to the frontData of each register.
As seen from table 4, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a14, a15, a12, a13, a14, a15, respectively.
S35172: inputting the image through a Load part of a vector processorThe size of the position isAfter the sub-matrix data are loaded to the vector register group in sequenceIn a vector register.
To be provided with、、、、、For example, when、、Input an imageThe size of the position isAfter the sub-matrix data are loaded to the input window vector register group in sequenceThe results of the vector registers are shown in table 5, and table 5 shows the data of the updated vector processor group.
TABLE 5 data of the updated vector processor set
As shown in table 5, the data in the vector registers VR0, VR1, VR2, VR3, VR4, and VR5 are a14, a15, a16, a17, a18, and a19, respectively, that is, the update of the data in the vector register group is realized.
The data updating method provided by the embodiment enables the data of each second partition to be processed, and improves the access performance of the vector processor.
In the above embodiments, the method for converting an image into a matrix row is described in detail, and the present application also provides a corresponding embodiment of the vector processor-oriented image-to-matrix row conversion apparatus. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.
Fig. 7 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to an embodiment of the present application. The present embodiment is based on the angle of the function module, including:
the acquisition module 10 is used for acquiring a target image and storing the target image into the DDR;
the loading module 11 is configured to invoke a DMA operation to load a target image from a DDR to an AM space;
a conversion processing module 12, configured to perform vectorization conversion processing on the target image from an image to a matrix row by using a Load component and a Store component of the vector processor, and obtain a converted matrix; wherein, the target image exists in the form of multi-dimensional array in the vector processor;
and the storage module 13 is used for calling DMA operation to store the converted matrix from the AM space back to the DDR.
Since the embodiment of the apparatus portion and the embodiment of the method portion correspond to each other, please refer to the description of the embodiment of the method portion for the embodiment of the apparatus portion, and details are not repeated here.
Since the above mentioned image-to-matrix row conversion method facing the vector processor has corresponding technical features with the image-to-matrix row conversion apparatus facing the vector processor of the present embodiment, the image-to-matrix row conversion apparatus facing the vector processor provided by the present embodiment has the same beneficial effects as the above mentioned image-to-matrix row conversion method facing the vector processor.
Fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application. This embodiment is based on a hardware perspective, and as shown in fig. 8, the apparatus includes:
a memory 20 for storing a computer program;
a processor 21 for implementing the steps of the vector processor oriented method of image to matrix row conversion as mentioned in the above embodiments when executing the computer program.
The image-to-matrix row conversion apparatus for vector processors provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, wherein the computer program is loaded and executed by the processor 21, and then the relevant steps of the method for converting an image into a matrix row disclosed in any one of the foregoing embodiments can be implemented. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, windows, unix, linux, and the like. Data 203 may include, but is not limited to, data involved in the vector processor-oriented image-to-matrix row conversion methods mentioned above, and the like.
In some embodiments, the vector processor-oriented image-to-matrix row conversion device may further include a display screen 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the architecture shown in fig. 8 does not constitute a limitation of the image-to-matrix row conversion means of the vector-oriented processor and may include more or fewer components than those shown.
The image-to-matrix row conversion device facing the vector processor provided by the embodiment of the application comprises a memory and a processor, wherein when the processor executes a program stored in the memory, the following method can be realized: the effect of the image-to-matrix row conversion method facing the vector processor is the same as that of the image-to-matrix row conversion method.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The computer-readable storage medium provided by the present application includes the above-mentioned image-to-matrix row conversion method for vector processors, and the effects are the same as above.
The present application provides a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, without departing from the principle of the present application, the present application can also make several improvements and modifications, and those improvements and modifications also fall into the protection scope of the claims of the present application.
It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.
Claims (7)
1. A method for converting an image into a matrix row facing a vector processor, comprising:
obtaining a target image and storing the target image in a double rate synchronous dynamic random access memory (DDR SDRAM), and markingThe stored data layout isIn whichWhich is indicative of the size of the batch process,andrepresenting the height and width of the input feature map,representing the width of data processed in parallel by the vector processing unit,representing the number of blocks of a channel on the input feature map;
invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operationIs marked as(ii) a WhereinRepresenting the filled data transferred to the array memory space of the target image,representing the filled height of the target image transferred into the array memory space,representing a width of the target image after transfer to the array memory space;
performing vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;
invoking the direct memory access operation to store the converted matrix from the array memory space back into the double rate synchronous dynamic random access memory;
the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first block from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space;
the performing image-to-matrix row vectorization conversion processing on the target image by using a vector reading part and a vector writing part of a vector processor and acquiring a converted matrix comprises:
partitioning the first partition in the array memory space to obtain a second partition;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a vector register set using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
obtaining the converted matrix;
before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
when the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is divided into a plurality of vector register groupsThe sub-matrix data stored in the vector register are copied to the vector register group in sequenceA plurality of said vector registers; wherein saidIs the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into m vector registers of the vector register set by using the vector reading part of the vector processor, and entering a step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form a converted matrix; wherein the size of m is the width of the second partitionAnd the above-mentionedThe difference of (c).
2. The vector processor-oriented image-to-matrix row conversion method of claim 1, wherein the blocking the target image to obtain first blocks comprises:
blocking the target image according to the array memory space parameters so as to obtain each first block; wherein the array memory space parameters at least comprise index parameters corresponding to each dimension.
3. The vector processor-oriented image-to-matrix row conversion method of claim 2, wherein the blocking the first partition in the array memory space to obtain a second partition comprises:
and partitioning the first partition in the array memory space according to the array memory space parameters and the parameters of convolution calculation so as to obtain the second partition.
4. The vector processor-oriented image-to-matrix row conversion method of claim 3, wherein the width of the second partition is less than or equal to the total number of vector registers.
5. An image-to-matrix row conversion apparatus for a vector processor, comprising:
an acquisition module for acquiring a target image and storing the target image in a double-rate synchronous dynamic random access memory (DDR SDRAM) with a tag ofThe stored data is laid out asWhereinWhich is indicative of the size of the batch process,andrepresenting the height and width of the input feature map,representing the width of data processed in parallel by the vector processing unit,representing the number of blocks of a channel on the input feature map;
a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operationIs marked by(ii) a WhereinRepresenting the filled data transferred to the array memory space of the target image,representing the filled height of the target image transferred into the array memory space,representing the width of the target image after being transmitted to the array memory space;
the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;
a storage module, configured to invoke the dma operation to store the converted matrix from the array memory space back to the ddr sdram;
the invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to a corresponding position of the array memory space;
the vectorization processing of the target image to a matrix row by using the vector reading part and the vector writing part of the vector processor and obtaining the converted matrix comprises:
blocking the first tile in the array memory space to obtain a second tile;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a vector register set using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
obtaining the converted matrix;
before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
when the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is divided into a plurality of vector register groupsThe sub-matrix data stored in the vector register are copied to the vector register group in sequenceA plurality of said vector registers; wherein saidIs the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into m vector registers of the vector register set by using the vector reading part of the vector processor, and entering a step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form a converted matrix; wherein the size of m is the width of the second partitionAnd the above-mentionedThe difference of (c).
6. An image-to-matrix row conversion apparatus for a vector processor, comprising:
a memory for storing a computer program;
processor for implementing the steps of the vector processor oriented image to matrix row conversion method as claimed in any one of claims 1 to 4 when executing said computer program.
7. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the vector processor oriented image-to-matrix row conversion method of any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211043942.9A CN115114575B (en) | 2022-08-30 | 2022-08-30 | Vector processor-oriented image-to-matrix row conversion method, device and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211043942.9A CN115114575B (en) | 2022-08-30 | 2022-08-30 | Vector processor-oriented image-to-matrix row conversion method, device and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115114575A CN115114575A (en) | 2022-09-27 |
CN115114575B true CN115114575B (en) | 2023-01-31 |
Family
ID=83335344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211043942.9A Active CN115114575B (en) | 2022-08-30 | 2022-08-30 | Vector processor-oriented image-to-matrix row conversion method, device and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115114575B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110796235A (en) * | 2019-10-21 | 2020-02-14 | 中国人民解放军国防科技大学 | Vectorization implementation method for Valid convolution of convolutional neural network |
WO2020059156A1 (en) * | 2018-09-18 | 2020-03-26 | Nec Corporation | Data processing system, method, and program |
CN113806261A (en) * | 2021-10-09 | 2021-12-17 | 中国人民解放军国防科技大学 | Pooling vectorization implementation method for vector processor |
CN114330669A (en) * | 2021-12-30 | 2022-04-12 | 中国人民解放军国防科技大学 | Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system |
CN114329326A (en) * | 2021-12-10 | 2022-04-12 | 中国人民解放军国防科技大学 | Low-bit-width data matrix vectorization column expansion method and system of vector processor |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11417373B2 (en) * | 2020-12-09 | 2022-08-16 | Micron Technology, Inc. | Neuromorphic computing devices and methods |
-
2022
- 2022-08-30 CN CN202211043942.9A patent/CN115114575B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020059156A1 (en) * | 2018-09-18 | 2020-03-26 | Nec Corporation | Data processing system, method, and program |
CN110796235A (en) * | 2019-10-21 | 2020-02-14 | 中国人民解放军国防科技大学 | Vectorization implementation method for Valid convolution of convolutional neural network |
CN113806261A (en) * | 2021-10-09 | 2021-12-17 | 中国人民解放军国防科技大学 | Pooling vectorization implementation method for vector processor |
CN114329326A (en) * | 2021-12-10 | 2022-04-12 | 中国人民解放军国防科技大学 | Low-bit-width data matrix vectorization column expansion method and system of vector processor |
CN114330669A (en) * | 2021-12-30 | 2022-04-12 | 中国人民解放军国防科技大学 | Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system |
Also Published As
Publication number | Publication date |
---|---|
CN115114575A (en) | 2022-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6977239B2 (en) | Matrix multiplier | |
CN109919311B (en) | Method for generating instruction sequence, method and device for executing neural network operation | |
JP7329533B2 (en) | Method and accelerator apparatus for accelerating operations | |
CN109213962B (en) | Operation accelerator | |
EP3698313B1 (en) | Image preprocessing for generalized image processing | |
CN108108811B (en) | Convolution calculation method in neural network and electronic device | |
CN107844828B (en) | Convolution calculation method in neural network and electronic device | |
CN111758107B (en) | System and method for hardware-based pooling | |
US10769749B2 (en) | Processor, information processing apparatus, and operation method of processor | |
CN110415157B (en) | Matrix multiplication calculation method and device | |
WO2021080873A1 (en) | Structured pruning for machine learning model | |
EP3093757B1 (en) | Multi-dimensional sliding window operation for a vector processor | |
CN113673701A (en) | Method for operating neural network model, readable medium and electronic device | |
KR20210014561A (en) | Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium | |
CN111125617A (en) | Data processing method, data processing device, computer equipment and storage medium | |
CN113313247B (en) | Operation method of sparse neural network based on data flow architecture | |
US20230259743A1 (en) | Neural network accelerator with configurable pooling processing unit | |
US20090064120A1 (en) | Method and apparatus to achieve maximum outer level parallelism of a loop | |
CN112966729B (en) | Data processing method and device, computer equipment and storage medium | |
CN115114575B (en) | Vector processor-oriented image-to-matrix row conversion method, device and medium | |
CN111931937B (en) | Gradient updating method, device and system of image processing model | |
TW202234266A (en) | Performing tensor operations using a programmable control engine | |
US11748100B2 (en) | Processing in memory methods for convolutional operations | |
EP4300369A1 (en) | Methods and systems for executing a neural network on a neural network accelerator | |
CN117851742A (en) | Data storage method, data processing method, data memory and data processor |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |