CN115114575B - Vector processor-oriented image-to-matrix row conversion method, device and medium - Google Patents

Vector processor-oriented image-to-matrix row conversion method, device and medium Download PDF

Info

Publication number
CN115114575B
CN115114575B CN202211043942.9A CN202211043942A CN115114575B CN 115114575 B CN115114575 B CN 115114575B CN 202211043942 A CN202211043942 A CN 202211043942A CN 115114575 B CN115114575 B CN 115114575B
Authority
CN
China
Prior art keywords
vector
matrix
target image
image
vector processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211043942.9A
Other languages
Chinese (zh)
Other versions
CN115114575A (en
Inventor
王庆林
廖林玉
尹尚飞
梅松竹
许金伟
李东升
姜晶菲
苏华友
李荣春
符永铨
刘杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202211043942.9A priority Critical patent/CN115114575B/en
Publication of CN115114575A publication Critical patent/CN115114575A/en
Application granted granted Critical
Publication of CN115114575B publication Critical patent/CN115114575B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Image Processing (AREA)

Abstract

The application discloses a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and relates to the technical field of image processing. The method comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing (im 2 row) from an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. According to the method for performing the im2row operation in the vector processor, data are transmitted in a multi-dimensional array form in the transmission process through the DMA operation, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation can be improved.

Description

Vector processor-oriented image-to-matrix row conversion method, device and medium
Technical Field
The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor.
Background
The image-to-matrix row conversion operation (im 2 row) is one of the important operations for the general convolution implementation in deep learning. On the vector processor, the conversion from the image with any size to the matrix row is efficiently realized, the general convolution performance on the vector processor can be effectively improved, and the convolution operation and the neural network type supported by the vector processor are greatly expanded.
Image-to-matrix row conversion operations are typically memory-intensive operations, so optimizing memory performance in im2row implementations is critical to speeding up im2 row. On a vector processor, there are generally two methods of implementation: the first is directly through scalar operations on a vector processor; the second is a Direct Memory Access (DMA) operation of the vector processor, which is implemented by Direct conversion on a Double Data Rate Dynamic Random Access (DDR) external to the vector processor. The performance of a DMA operation is directly related to the size of the block during its transfer, and generally speaking, the larger the block, the more efficiently the DMA component can exploit the memory bandwidth performance of the vector processor. For the first scalar operation, the operation equivalent to DMA is 1 element. For the second approach, the block size is typically 1 element or L elements, where L represents the data width that the vector processing unit processes in parallel. As described above, the block size of the DMA operation in the two common implementations is small, so that the memory access performance of the vector processor cannot be effectively exerted.
Therefore, the problem to be solved by the skilled person is how to efficiently implement im2row so as to improve the performance of the image-to-matrix row conversion operation on the vector processor.
Disclosure of Invention
The application aims to provide an image-to-matrix row conversion method, an image-to-matrix row conversion device and a medium for a vector processor, which are used for improving the image-to-matrix row conversion operation performance of the vector processor.
In order to solve the above technical problem, the present application provides a method for converting an image into a matrix row for a vector processor, including:
acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;
invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space;
performing vectorization conversion processing of an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;
and invoking the direct memory access operation to store the converted matrix from the array memory space back into the double-rate synchronous dynamic random access memory.
Preferably, the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
and calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space.
Preferably, the blocking the target image so as to obtain each first block includes:
blocking the target image according to the array memory space parameters so as to obtain each first block; the array memory space parameters at least comprise index parameters corresponding to all dimensions.
Preferably, the vector reading part and the vector writing part of the vector processor are used for carrying out vectorization conversion processing of the target image to matrix rows and obtaining a converted matrix, and the vectorization conversion processing comprises the following steps:
blocking the first tile in the array memory space to obtain a second tile;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a set of vector registers using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
and acquiring the converted matrix.
Preferably, said blocking said first partition in said array memory space to obtain a second partition comprises:
and partitioning the first partition in the array memory space according to the array memory space parameters and the parameters of convolution calculation so as to obtain the second partition.
Preferably, the width of the second partition is less than or equal to the total number of vector registers.
Preferably, before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged at the back
Figure 368145DEST_PATH_IMAGE001
The sub-matrix data stored in the vector register is copied to the vector register group in sequence before being copied
Figure 3395DEST_PATH_IMAGE001
A plurality of said vector registers; wherein said
Figure 251973DEST_PATH_IMAGE001
Is the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into the last m vector registers in the vector register set by using the vector reading part of the vector processor, and entering the step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form the converted matrix; wherein the size of m is the width of the second partition
Figure 116024DEST_PATH_IMAGE002
And is as described above
Figure 399238DEST_PATH_IMAGE001
The difference of (a).
In order to solve the above technical problem, the present application further provides an image-to-matrix row conversion apparatus facing a vector processor, including:
the device comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;
a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space;
the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;
and the storage module is used for calling the direct memory access operation to store the converted matrix from the array memory space back to the double-rate synchronous dynamic random access memory.
In order to solve the above technical problem, the present application further provides an image-to-matrix row conversion apparatus facing a vector processor, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the image-to-matrix row conversion method facing the vector processor when executing the computer program.
In order to solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above image-to-matrix row conversion method for a vector processor.
The method for converting the image facing to the vector processor into the matrix row comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. Compared with the prior method for directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and the method for performing im2row operation in the vector processor provided by the application has the advantages that data are transmitted in a multidimensional array form through the DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is remarkably increased, the number of times of the DMA operation is greatly reduced, and therefore the storage bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is remarkably improved.
In addition, the application also provides an image-to-matrix row conversion device facing the vector processor and a computer readable storage medium, which correspond to the above mentioned method for converting the image facing the vector processor into the matrix row, and the effects are the same.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.
Fig. 1 is a block diagram of a vector processor according to an embodiment of the present disclosure;
fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application;
FIG. 3 is a block diagram provided in an embodiment of the present application;
fig. 4 is a flowchart of a vectorization processing method for implementing vectorization conversion processing from an input image to a matrix row in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application;
FIG. 5 is a flow chart of a method for completing a transformation provided by an embodiment of the present application;
FIG. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application;
FIG. 7 is a block diagram of an image-to-matrix row conversion apparatus for a vector processor according to an embodiment of the present application;
fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.
The core of the application is to provide a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and the vector processor-oriented medium is used for improving the performance of the image-to-matrix row conversion on the vector processor.
In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. Fig. 1 is a block diagram of a vector processor according to an embodiment of the present application. As shown in fig. 1, the Vector processor includes a Scalar Processing Unit (SPU) that performs Scalar operations, a Vector Processing Unit (VPU) that performs Vector operations, a Direct Memory Access (DMA) component that is responsible for data transfer, and the like. The SPU is composed of a Scalar Processing Element (SPE) and a Scalar Memory (SM). The VPU is composed of J Vector Processing Elements (VPEs) and an Array Memory (AM), and the J Vector Processing Elements (VPEs) cooperatively operate in a Single Instruction Multiple Data (SIMD) manner, support the turning-off and turning-on of a specific VPE Element, but do not support Data interaction between Multiple VPEs. A single VPE can process 18 bytes of data (e.g., FP64, int 64) or 24 bytes of data (e.g., FP32, int 32) at a time. The DMA component is responsible for data transfer between SM and DDR, AM and DDR, with a minimum granularity of operation of also 8 bytes. Fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application, where as shown in fig. 2, the method includes:
s1: and acquiring a target image and storing the target image into the DDR.
The target image is an image that needs convolution calculation. Storing the convolution computed input feature map (input image) data matrix in the DDR, labeled
Figure 438345DEST_PATH_IMAGE003
Stored data layout
Figure 174220DEST_PATH_IMAGE003
Figure 841961DEST_PATH_IMAGE004
In which
Figure 979682DEST_PATH_IMAGE005
Which is indicative of the size of the batch process,
Figure 691155DEST_PATH_IMAGE006
and
Figure 914326DEST_PATH_IMAGE007
representing the height and width of the input feature map,
Figure 120179DEST_PATH_IMAGE008
indicating the data width of the parallel processing by the vector processing unit (when the data type is FP64, int64, L = J; when the data type is FP32, int32,
Figure 112406DEST_PATH_IMAGE009
),
Figure 761824DEST_PATH_IMAGE010
representing the number of blocks of a channel on the input profile, the number of channels of the input profile being
Figure 472291DEST_PATH_IMAGE011
. The height and width of the convolution kernel in the convolution calculation are respectively
Figure 481835DEST_PATH_IMAGE012
And
Figure 594148DEST_PATH_IMAGE013
step size S, fill size
Figure 850686DEST_PATH_IMAGE014
. Thus, the final result matrix of im2row operations is stored in the DDR, labeled
Figure 48449DEST_PATH_IMAGE015
The data layout is
Figure 596105DEST_PATH_IMAGE015
Figure 310727DEST_PATH_IMAGE016
Wherein, in the step (A),
Figure 551215DEST_PATH_IMAGE017
Figure 970695DEST_PATH_IMAGE018
respectively representing the height and width of the output characteristic diagram of the convolution calculation, and the calculation formulas are respectively shown as formulas (1) and (2), wherein
Figure 322042DEST_PATH_IMAGE019
Meaning rounding down.
Figure 392635DEST_PATH_IMAGE020
(1)
Figure 69604DEST_PATH_IMAGE021
(2)
S2: and calling DMA operation to load the target image from the DDR to the AM space.
In order to improve the access performance of the vector processor, the im2row operation which is performed off chip is converted into the operation which is performed on chip, and the DMA operation is called to load the target image from the DDR to the AM space.
S3: performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor.
Since the Load component (i.e., vector read component) and the Store component (i.e., vector write component) of the vector processor can implement the vectorization conversion process of the image into matrix rows, im2row operation is performed on the AM space by using the Load component and the Store component, resulting in a converted matrix.
S4: the DMA operation is invoked to store the translated matrix from the AM space back into the DDR.
In order to improve the access performance of the vector processor, the target image is loaded from the DDR to the AM space through the DMA operation, and after im2row operation is performed on the AM space, the converted matrix needs to be stored from the AM space back to the DDR. In storing from AM space back to DDR, this is still accomplished by invoking DMA operations.
The method for converting an image to a matrix row facing a vector processor provided by the embodiment comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load a target image from the DDR to an AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the translated matrix from the AM space back into the DDR. Compared with the prior method of directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and according to the method of performing im2row operation in a vector processor provided by this embodiment, data is transmitted in a multidimensional array form through DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is significantly increased, and the number of times of the DMA operation is greatly reduced, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is significantly improved.
In implementation, in order to load the target image from the DDR to the AM space, preferably, the method includes the following steps:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first partition;
and calling DMA operation to load the sub-matrix data corresponding to each first partition from the DDR to the corresponding position of the AM space.
Since the AM space is limited, the target image needs to be loaded from the DDR to the AM space as a plurality of first blocks. Marking AM space size
Figure 976380DEST_PATH_IMAGE022
The original block input feature icon of each time the AM space is transmitted is marked as
Figure 865839DEST_PATH_IMAGE023
Size mark is
Figure 557982DEST_PATH_IMAGE024
(ii) a After being transmitted into AM space, the AM space is expanded into
Figure 140273DEST_PATH_IMAGE025
Is marked as
Figure 534346DEST_PATH_IMAGE026
(ii) a The im2row result produced was of the size
Figure 680025DEST_PATH_IMAGE027
Is marked by
Figure 210364DEST_PATH_IMAGE028
In which
Figure 963556DEST_PATH_IMAGE029
Figure 844924DEST_PATH_IMAGE030
. Due to the limitation of AM space size
Figure 89567DEST_PATH_IMAGE031
And with
Figure 739992DEST_PATH_IMAGE032
The total space size does not exceed
Figure 398506DEST_PATH_IMAGE033
Namely:
Figure 563908DEST_PATH_IMAGE034
wherein
Figure 598860DEST_PATH_IMAGE035
And (3) representing the number of bytes of the memory space occupied by each element in the convolution calculation, wherein t =4 if the convolution calculation adopts a single-precision floating point number FP 32. Thus, im is determinedBlock size of 2row operation outgoing AM space
Figure 618638DEST_PATH_IMAGE036
And
Figure 448053DEST_PATH_IMAGE037
the embodiment of the application provides a method for loading one block in a target image from a DDR to an AM space by calling DMA operation. The method comprises the following steps:
s201: initialization
Figure 304014DEST_PATH_IMAGE038
Where N represents an index in the N dimension;
s202: initialization
Figure 142657DEST_PATH_IMAGE039
Wherein
Figure 252827DEST_PATH_IMAGE039
To represent
Figure 518723DEST_PATH_IMAGE040
An index in a dimension;
s203: initialization
Figure 596400DEST_PATH_IMAGE041
Wherein
Figure 238734DEST_PATH_IMAGE042
To represent
Figure 967525DEST_PATH_IMAGE043
An index in a dimension;
s204: computing
Figure 404322DEST_PATH_IMAGE043
Dimension actual block size
Figure 703717DEST_PATH_IMAGE044
Wherein min represents the minimum of the two numbers;
s205: initialization
Figure 884162DEST_PATH_IMAGE045
Wherein
Figure 965994DEST_PATH_IMAGE046
To represent
Figure 573693DEST_PATH_IMAGE047
A start index in a dimension;
s206: judgment of
Figure 625963DEST_PATH_IMAGE046
If yes, go to step S207; if not, go to step S208;
s207: initialization
Figure 344520DEST_PATH_IMAGE048
And make an order
Figure 47903DEST_PATH_IMAGE049
The flow advances to step S209;
wherein
Figure 623241DEST_PATH_IMAGE050
Represent
Figure 366069DEST_PATH_IMAGE051
An index of (a);
s208: initialization
Figure 685055DEST_PATH_IMAGE052
S209: initialization
Figure 993676DEST_PATH_IMAGE053
To represent
Figure 693910DEST_PATH_IMAGE054
End index in dimension, min represents the minimum value of two numbers;
s210: calculating out
Figure 455193DEST_PATH_IMAGE055
Representing an incoming AM space from a DDR
Figure 781132DEST_PATH_IMAGE056
A block size in dimension;
s211: initialization
Figure 944260DEST_PATH_IMAGE057
Wherein
Figure 313930DEST_PATH_IMAGE058
To represent
Figure 828088DEST_PATH_IMAGE059
An index in a dimension;
s212: calculating out
Figure 692139DEST_PATH_IMAGE059
Dimension actual block size
Figure 709773DEST_PATH_IMAGE060
Wherein min represents the minimum of the two numbers;
s213: initialization
Figure 14459DEST_PATH_IMAGE061
Wherein
Figure 750334DEST_PATH_IMAGE062
Represent
Figure 418076DEST_PATH_IMAGE063
A start index in a dimension;
s214: judgment of
Figure 290217DEST_PATH_IMAGE064
If not, the process goes to step S215; if not, go to step S216;
s215: initialization
Figure 267269DEST_PATH_IMAGE065
And make an order
Figure 490440DEST_PATH_IMAGE066
The flow advances to step S217;
wherein
Figure 696294DEST_PATH_IMAGE067
Represent
Figure 688520DEST_PATH_IMAGE068
An index in a dimension;
s216: initialization
Figure 72359DEST_PATH_IMAGE069
S217: initialization
Figure 517247DEST_PATH_IMAGE070
To represent
Figure 526792DEST_PATH_IMAGE071
A dimensional end index;
s218: computing
Figure 639104DEST_PATH_IMAGE072
Representing an incoming AM space from a DDR
Figure 692380DEST_PATH_IMAGE071
A block size in dimension;
s219: in AM is
Figure 890143DEST_PATH_IMAGE073
All elements of the space are all initialized to 0.
S220: invoking a Direct Memory Access (DMA) operation will
Figure 437799DEST_PATH_IMAGE074
The size of the position is
Figure 404618DEST_PATH_IMAGE075
Into on-chip AM space
Figure 127330DEST_PATH_IMAGE076
At the location.
To be provided with
Figure 812389DEST_PATH_IMAGE077
Figure 163736DEST_PATH_IMAGE078
Figure 985061DEST_PATH_IMAGE079
Figure 645719DEST_PATH_IMAGE080
For example, table 1 shows the AM space
Figure 552495DEST_PATH_IMAGE081
Assuming that it is in AM space at this time
Figure 707533DEST_PATH_IMAGE081
The partial data of (a) are as follows:
TABLE 1 in AM space
Figure 648944DEST_PATH_IMAGE081
Part of data of
Figure 450809DEST_PATH_IMAGE082
When the target image is loaded to the AM space from the DDR by calling the DMA operation provided by the embodiment, the target image is blocked into a plurality of first blocks, and the target image can be completely loaded to the AM space by each block; the vector processor can load each block from DDR to AM space in order through the block processing.
In implementation, when loading the target image from the DDR to the AM space, in order to reasonably block the target image, it is preferable that the blocking the target image so as to obtain each first block includes:
blocking the target image according to the AM space parameters so as to obtain first blocks; the AM space parameters at least comprise index parameters corresponding to all dimensions.
When the target image is blocked, the target image can be reasonably blocked according to the parameters of the AM space. The parameters in the AM space at least include index parameters corresponding to each dimension. The procedure shown in the method for loading a block in a target image from a DDR to an AM space by calling a DMA operation in the above embodiment is according to
Figure 844881DEST_PATH_IMAGE006
Figure 803610DEST_PATH_IMAGE007
The corresponding index parameter is finally obtained
Figure 599528DEST_PATH_IMAGE006
Dimension, degree,
Figure 601987DEST_PATH_IMAGE007
The size of the partition in dimension.
The target image is loaded from DDR to the AM space according to the parameters on the AM space, so that the target image can be reasonably partitioned.
In implementation, in order to facilitate im2row operation on the first partition, it is preferable that performing vectorization conversion processing of an image of a target image into matrix rows using a Load component and a Store component of a vector processor and acquiring a converted matrix includes:
partitioning the first partition in an AM space to obtain a second partition;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each second partition into a vector register set by using a Load component of the vector processor;
storing, by a Store component of the vector processor, sub-matrix data corresponding to each second partition in the vector register set at a corresponding location to form a converted matrix;
and acquiring the converted matrix.
In that
Figure 483356DEST_PATH_IMAGE026
Is
Figure 917879DEST_PATH_IMAGE083
The blocks are divided in dimension, each block is called a line input window WinInput, and the width of each block is marked as WinInput
Figure 605123DEST_PATH_IMAGE084
Between two adjacent line input windows
Figure 529217DEST_PATH_IMAGE085
Overlap of size width wherein
Figure 897881DEST_PATH_IMAGE086
. Outputting convolution corresponding to a line input window
Figure 385363DEST_PATH_IMAGE037
Dimensional block size marking
Figure 890294DEST_PATH_IMAGE087
Having a value of
Figure 250868DEST_PATH_IMAGE088
. In view of
Figure 106829DEST_PATH_IMAGE084
Must be less than or equal to the total number of available vector registers in the vector processor
Figure 430625DEST_PATH_IMAGE089
And an
Figure 55641DEST_PATH_IMAGE087
Must be a limitation of integers, i.e.
Figure 524800DEST_PATH_IMAGE090
While satisfying
Figure 851745DEST_PATH_IMAGE091
Wherein
Figure 494079DEST_PATH_IMAGE092
The remainder is expressed. Thereby determining
Figure 973602DEST_PATH_IMAGE084
The value of (c).
To be provided with
Figure 675978DEST_PATH_IMAGE077
Figure 988755DEST_PATH_IMAGE078
Figure 106883DEST_PATH_IMAGE079
Figure 424601DEST_PATH_IMAGE080
For the purpose of example, it is preferred that,
Figure 297879DEST_PATH_IMAGE093
taking out
Figure 553411DEST_PATH_IMAGE094
. When in use
Figure 288280DEST_PATH_IMAGE095
The blocks are shown in fig. 3. Fig. 3 is a block diagram provided in an embodiment of the present application. From a0 to a5 is a first line input window, from a4 to a9 is a second line input window, and a4 and a5 are overlapping portions of the first and second line input windows. Fig. 4 is a flowchart of a processing method for implementing vectorization conversion from an input image to matrix rows in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application. As shown in fig. 4, the method includes:
s31: in that
Figure 742395DEST_PATH_IMAGE031
Is/are as follows
Figure 724258DEST_PATH_IMAGE083
Dimensionally chunking and determining
Figure 778670DEST_PATH_IMAGE084
A value;
s32: initialization
Figure 504181DEST_PATH_IMAGE096
Wherein
Figure 812802DEST_PATH_IMAGE097
To represent
Figure 182210DEST_PATH_IMAGE098
An index in a dimension;
s33: initialization
Figure 192761DEST_PATH_IMAGE099
Wherein
Figure 721962DEST_PATH_IMAGE100
Represent
Figure 885090DEST_PATH_IMAGE101
An index in a dimension;
s34: by vector processorsA Load component for inputting the feature map of the blocks in the AM space
Figure 756225DEST_PATH_IMAGE102
The size of the position is
Figure 535962DEST_PATH_IMAGE103
Sequentially loading the submatrix data into a vector register set.
To be provided with
Figure 400013DEST_PATH_IMAGE077
Figure 417648DEST_PATH_IMAGE078
Figure 427061DEST_PATH_IMAGE079
Figure 162936DEST_PATH_IMAGE080
Figure 565098DEST_PATH_IMAGE094
For example, when
Figure 450621DEST_PATH_IMAGE096
Figure 443985DEST_PATH_IMAGE099
In order to do
Figure 401577DEST_PATH_IMAGE104
The data of the size is loaded into the vector register sets VR0, VR1, VR2, VR3, VR4, VR5 at a time, and the result is shown in Table 2 below, where Table 2 is the data to be loaded into
Figure 607430DEST_PATH_IMAGE104
The size data is loaded into the data of the vector register set at once.
TABLE 2 will
Figure 114504DEST_PATH_IMAGE104
Data of sizeData loaded into vector register set at once
Figure 950873DEST_PATH_IMAGE105
As can be seen from table 2, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a10, a11, a12, a13, a14, a15, respectively.
S35: storing data in a vector register set by a Store component of a vector processor
Figure 661340DEST_PATH_IMAGE106
The corresponding position of (a).
The embodiment of the application provides a method for storing data in a vector register set
Figure 687196DEST_PATH_IMAGE106
The method of (a), the method comprising:
s3501: initialization
Figure 737191DEST_PATH_IMAGE107
Wherein
Figure 806779DEST_PATH_IMAGE108
Represented in a line input window
Figure 988230DEST_PATH_IMAGE084
An index in a dimension;
s3502: initialization
Figure 67044DEST_PATH_IMAGE109
Wherein
Figure 237126DEST_PATH_IMAGE110
Is shown in
Figure 959838DEST_PATH_IMAGE111
An index in a dimension;
s3503: computing
Figure 910476DEST_PATH_IMAGE112
Wherein
Figure 261823DEST_PATH_IMAGE113
Is shown in
Figure 83149DEST_PATH_IMAGE114
The index of (a);
s3504: judgment of
Figure 478227DEST_PATH_IMAGE115
If yes, executing step S3505, otherwise, jumping to step S3511;
s3505: initialization
Figure 385003DEST_PATH_IMAGE116
Wherein
Figure 540041DEST_PATH_IMAGE117
Is shown in
Figure 169867DEST_PATH_IMAGE118
An index in a dimension;
s3506: calculating out
Figure 752159DEST_PATH_IMAGE119
Wherein
Figure 146231DEST_PATH_IMAGE120
Is shown in
Figure 104959DEST_PATH_IMAGE087
An index in a dimension;
s3507: judgment of
Figure 353407DEST_PATH_IMAGE121
If yes, go to step S3508,if not, jumping to step S3509;
s3508: store component based vector processor
Figure 106599DEST_PATH_IMAGE122
Is stored in
Figure 987968DEST_PATH_IMAGE123
Figure 435873DEST_PATH_IMAGE124
A location;
s3509: order to
Figure 86297DEST_PATH_IMAGE125
S3510: judgment of
Figure 10391DEST_PATH_IMAGE117
Whether or not less than
Figure 893902DEST_PATH_IMAGE126
If yes, the process returns to step S3506, and if no, step S3511 is executed.
S3511: order to
Figure 132117DEST_PATH_IMAGE127
S3512: judgment of
Figure 637047DEST_PATH_IMAGE128
Whether or not less than
Figure 748354DEST_PATH_IMAGE129
If yes, returning to the step S3503, otherwise, executing the step S3513;
s3513: order to
Figure 541998DEST_PATH_IMAGE130
S3514: judgment of
Figure 380641DEST_PATH_IMAGE131
If the value is greater than or equal to 0, jumping to step S3502 if the value is greater than or equal to 0, and otherwise, executing step S3515;
s3515: order to
Figure 254925DEST_PATH_IMAGE132
Figure 520821DEST_PATH_IMAGE133
Wherein
Figure 864078DEST_PATH_IMAGE134
To represent
Figure 975253DEST_PATH_IMAGE135
A load index in a dimension;
s3516: judgment of
Figure 468158DEST_PATH_IMAGE136
Whether or not less than
Figure 904956DEST_PATH_IMAGE137
If yes, go to step S3517; if not, jumping to the step S3518;
s3517: updating the data of the vector register group and returning to the step S3501;
s3518: order to
Figure 673191DEST_PATH_IMAGE138
S3519: judgment of
Figure 102905DEST_PATH_IMAGE139
Whether or not less than
Figure 702513DEST_PATH_IMAGE140
If yes, go to step S33 of fig. 4; if not, the step S41 is executed;
s33: initialization
Figure 575791DEST_PATH_IMAGE141
Wherein
Figure 316476DEST_PATH_IMAGE142
To represent
Figure 566192DEST_PATH_IMAGE143
An index in a dimension;
s41: invoking Direct Memory Access (DMA) operations into AM space
Figure 223570DEST_PATH_IMAGE144
The size of the position is
Figure 251437DEST_PATH_IMAGE145
To double rate synchronous dynamic random access memory
Figure 791003DEST_PATH_IMAGE146
To (3).
To be provided with
Figure 313251DEST_PATH_IMAGE077
Figure 621873DEST_PATH_IMAGE078
Figure 522439DEST_PATH_IMAGE079
Figure 814880DEST_PATH_IMAGE080
Figure 327770DEST_PATH_IMAGE094
For example, when
Figure 756478DEST_PATH_IMAGE147
Figure 142460DEST_PATH_IMAGE148
When the data is needed, the data loaded to VR0, VR1, VR2, VR3, VR4 and VR5 are stored in
Figure 391038DEST_PATH_IMAGE123
After corresponding position, it
Figure 5821DEST_PATH_IMAGE123
The data in (1) are as follows: TABLE 3 storage of data in vector register set
Figure 492298DEST_PATH_IMAGE123
In
Figure 49181DEST_PATH_IMAGE149
For the partial data of the 0 position, it should be noted that "-" in table 3 indicates data irrelevant to this storing operation.
TABLE 3 store data in vector register set
Figure 237586DEST_PATH_IMAGE123
In
Figure 702065DEST_PATH_IMAGE149
Partial data of 0 position
Figure 839785DEST_PATH_IMAGE150
Since im2row operation is performed on each partition in sequence in the present application, after the converted current matrix is stored from the AM space back to the DDR, it is necessary to determine whether processing is completed on each partition. Fig. 5 is a flowchart of a method for completing a conversion according to an embodiment of the present application. As shown in fig. 5, the method includes:
s42: completing the conversion of the current block;
s43: order to
Figure 301991DEST_PATH_IMAGE151
S44: judgment of
Figure 296402DEST_PATH_IMAGE152
Whether or not less than
Figure 767834DEST_PATH_IMAGE153
If yes, go back to step S212; if not, executing step S45;
s45: order to
Figure 963324DEST_PATH_IMAGE154
S46: judgment of
Figure 845698DEST_PATH_IMAGE155
Whether or not less than
Figure 556165DEST_PATH_IMAGE156
If yes, returning to the step S204, otherwise, executing the step S47;
s47: order to
Figure 955922DEST_PATH_IMAGE157
S48: judgment of
Figure 143933DEST_PATH_IMAGE158
Whether or not less than
Figure 85957DEST_PATH_IMAGE159
If yes, returning to step S203, otherwise, executing step S49;
s49: order to
Figure 375731DEST_PATH_IMAGE160
S50: judgment of
Figure 798753DEST_PATH_IMAGE161
If the value is less than N, returning to the step S202 if the value is less than N, otherwise, executing the step S51;
s51: the conversion is completed.
In the method for performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix provided by this embodiment, the first block data is blocked again, and im2row operations are performed respectively, so that the memory access performance of the image processor is greatly improved.
In implementation, in order to be able to more accurately perform blocking again on the first partition, it is a preferred embodiment that blocking the first partition in the AM space so as to obtain the second partition includes:
and partitioning the first partition in the AM space according to the AM space parameters and the convolution calculation parameters so as to obtain a second partition.
In the above embodiment, the second partitions have been described, and the row input windows corresponding to the second partitions are obtained
Figure 624627DEST_PATH_IMAGE084
Therefore, the method of partitioning the first partition again in the AM space according to the AM space parameters and the convolution calculated parameters will not be described again here.
The second partition obtained by blocking each first partition again provided by this embodiment performs im2row operation, so as to improve the memory access performance of the vector processor.
In implementation, since the number of vector registers that can be used in vector processing is limited, in order to enable the second partition to be processed by the available vector registers, it is preferred that the width of the second partition is less than or equal to the total number of vector registers.
The width of the second partition provided by the present embodiment is less than or equal to the total number of vector registers, so that the data of the second partition can be processed by the vector register bank.
When updating the data of the vector register, the preferred embodiment further includes, before storing, by the Store component of the vector processor, the sub-matrix data corresponding to each second partition in the vector register set in the corresponding position to form the converted matrix:
under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged in the later stage
Figure 865115DEST_PATH_IMAGE085
Before the sub-matrix data stored in the vector register is copied to the vector register group in sequence
Figure 173343DEST_PATH_IMAGE085
A vector register; wherein
Figure 679018DEST_PATH_IMAGE085
The size of (b) is the size of the width overlap of two adjacent second blocks;
sequentially loading the sub-matrix data corresponding to each second partition into the m vector registers in the vector register group by using a Load component of the vector processor, and storing the sub-matrix data corresponding to each second partition in the vector register group at a corresponding position by using a Store component of the vector processor to form a converted matrix; wherein m is the width of the second partition
Figure 985496DEST_PATH_IMAGE084
And
Figure 334569DEST_PATH_IMAGE085
the difference of (c).
In implementation, the specific process for updating the data of the vector register set of step S3517 is shown in fig. 6. Fig. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application, where the method includes:
s35171: will be provided with
Figure 225034DEST_PATH_IMAGE085
The data of the vector register is copied to the front in sequence
Figure 380071DEST_PATH_IMAGE085
A register;
to be provided with
Figure 55903DEST_PATH_IMAGE077
Figure 284801DEST_PATH_IMAGE078
Figure 928141DEST_PATH_IMAGE079
Figure 418028DEST_PATH_IMAGE080
Figure 417208DEST_PATH_IMAGE094
For the purpose of example only,
Figure 701559DEST_PATH_IMAGE093
when is coming into contact with
Figure 536922DEST_PATH_IMAGE147
Figure 33762DEST_PATH_IMAGE162
In order to do so
Figure 252561DEST_PATH_IMAGE085
The data in the vector register is copied to the front
Figure 379917DEST_PATH_IMAGE085
The results of the registers are shown in Table 4, and Table 4 is the following
Figure 201111DEST_PATH_IMAGE085
The data in the vector register is copied to the front
Figure 236063DEST_PATH_IMAGE085
Data of each register.
TABLE 4 description
Figure 6573DEST_PATH_IMAGE085
The data in the vector register is copied to the front
Figure 101568DEST_PATH_IMAGE085
Data of a register
Figure 442682DEST_PATH_IMAGE163
As seen from table 4, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a14, a15, a12, a13, a14, a15, respectively.
S35172: inputting the image through a Load part of a vector processor
Figure 671538DEST_PATH_IMAGE164
The size of the position is
Figure 296555DEST_PATH_IMAGE165
After the sub-matrix data are loaded to the vector register group in sequence
Figure 562451DEST_PATH_IMAGE166
In a vector register.
To be provided with
Figure 591193DEST_PATH_IMAGE077
Figure 967948DEST_PATH_IMAGE078
Figure 447471DEST_PATH_IMAGE079
Figure 133536DEST_PATH_IMAGE080
Figure 698510DEST_PATH_IMAGE094
Figure 878955DEST_PATH_IMAGE167
For example, when
Figure 212985DEST_PATH_IMAGE147
Figure 164891DEST_PATH_IMAGE162
Figure 217161DEST_PATH_IMAGE168
Input an image
Figure 201297DEST_PATH_IMAGE164
The size of the position is
Figure 426653DEST_PATH_IMAGE165
After the sub-matrix data are loaded to the input window vector register group in sequence
Figure 205253DEST_PATH_IMAGE166
The results of the vector registers are shown in table 5, and table 5 shows the data of the updated vector processor group.
TABLE 5 data of the updated vector processor set
Figure 744819DEST_PATH_IMAGE169
As shown in table 5, the data in the vector registers VR0, VR1, VR2, VR3, VR4, and VR5 are a14, a15, a16, a17, a18, and a19, respectively, that is, the update of the data in the vector register group is realized.
The data updating method provided by the embodiment enables the data of each second partition to be processed, and improves the access performance of the vector processor.
In the above embodiments, the method for converting an image into a matrix row is described in detail, and the present application also provides a corresponding embodiment of the vector processor-oriented image-to-matrix row conversion apparatus. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.
Fig. 7 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to an embodiment of the present application. The present embodiment is based on the angle of the function module, including:
the acquisition module 10 is used for acquiring a target image and storing the target image into the DDR;
the loading module 11 is configured to invoke a DMA operation to load a target image from a DDR to an AM space;
a conversion processing module 12, configured to perform vectorization conversion processing on the target image from an image to a matrix row by using a Load component and a Store component of the vector processor, and obtain a converted matrix; wherein, the target image exists in the form of multi-dimensional array in the vector processor;
and the storage module 13 is used for calling DMA operation to store the converted matrix from the AM space back to the DDR.
Since the embodiment of the apparatus portion and the embodiment of the method portion correspond to each other, please refer to the description of the embodiment of the method portion for the embodiment of the apparatus portion, and details are not repeated here.
Since the above mentioned image-to-matrix row conversion method facing the vector processor has corresponding technical features with the image-to-matrix row conversion apparatus facing the vector processor of the present embodiment, the image-to-matrix row conversion apparatus facing the vector processor provided by the present embodiment has the same beneficial effects as the above mentioned image-to-matrix row conversion method facing the vector processor.
Fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application. This embodiment is based on a hardware perspective, and as shown in fig. 8, the apparatus includes:
a memory 20 for storing a computer program;
a processor 21 for implementing the steps of the vector processor oriented method of image to matrix row conversion as mentioned in the above embodiments when executing the computer program.
The image-to-matrix row conversion apparatus for vector processors provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.
The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.
Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, wherein the computer program is loaded and executed by the processor 21, and then the relevant steps of the method for converting an image into a matrix row disclosed in any one of the foregoing embodiments can be implemented. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, windows, unix, linux, and the like. Data 203 may include, but is not limited to, data involved in the vector processor-oriented image-to-matrix row conversion methods mentioned above, and the like.
In some embodiments, the vector processor-oriented image-to-matrix row conversion device may further include a display screen 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.
Those skilled in the art will appreciate that the architecture shown in fig. 8 does not constitute a limitation of the image-to-matrix row conversion means of the vector-oriented processor and may include more or fewer components than those shown.
The image-to-matrix row conversion device facing the vector processor provided by the embodiment of the application comprises a memory and a processor, wherein when the processor executes a program stored in the memory, the following method can be realized: the effect of the image-to-matrix row conversion method facing the vector processor is the same as that of the image-to-matrix row conversion method.
Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.
It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The computer-readable storage medium provided by the present application includes the above-mentioned image-to-matrix row conversion method for vector processors, and the effects are the same as above.
The present application provides a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, without departing from the principle of the present application, the present application can also make several improvements and modifications, and those improvements and modifications also fall into the protection scope of the claims of the present application.
It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims (7)

1. A method for converting an image into a matrix row facing a vector processor, comprising:
obtaining a target image and storing the target image in a double rate synchronous dynamic random access memory (DDR SDRAM), and marking
Figure DEST_PATH_IMAGE002
The stored data layout is
Figure DEST_PATH_IMAGE004
In which
Figure DEST_PATH_IMAGE006
Which is indicative of the size of the batch process,
Figure DEST_PATH_IMAGE008
and
Figure DEST_PATH_IMAGE010
representing the height and width of the input feature map,
Figure DEST_PATH_IMAGE012
representing the width of data processed in parallel by the vector processing unit,
Figure DEST_PATH_IMAGE014
representing the number of blocks of a channel on the input feature map;
invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operation
Figure DEST_PATH_IMAGE016
Is marked as
Figure DEST_PATH_IMAGE018
(ii) a Wherein
Figure DEST_PATH_IMAGE020
Representing the filled data transferred to the array memory space of the target image,
Figure DEST_PATH_IMAGE022
representing the filled height of the target image transferred into the array memory space,
Figure DEST_PATH_IMAGE024
representing a width of the target image after transfer to the array memory space;
performing vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;
invoking the direct memory access operation to store the converted matrix from the array memory space back into the double rate synchronous dynamic random access memory;
the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first block from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space;
the performing image-to-matrix row vectorization conversion processing on the target image by using a vector reading part and a vector writing part of a vector processor and acquiring a converted matrix comprises:
partitioning the first partition in the array memory space to obtain a second partition;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a vector register set using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
obtaining the converted matrix;
before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
when the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is divided into a plurality of vector register groups
Figure DEST_PATH_IMAGE026
The sub-matrix data stored in the vector register are copied to the vector register group in sequence
Figure DEST_PATH_IMAGE028
A plurality of said vector registers; wherein said
Figure DEST_PATH_IMAGE026A
Is the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into m vector registers of the vector register set by using the vector reading part of the vector processor, and entering a step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form a converted matrix; wherein the size of m is the width of the second partition
Figure DEST_PATH_IMAGE030
And the above-mentioned
Figure 56820DEST_PATH_IMAGE028
The difference of (c).
2. The vector processor-oriented image-to-matrix row conversion method of claim 1, wherein the blocking the target image to obtain first blocks comprises:
blocking the target image according to the array memory space parameters so as to obtain each first block; wherein the array memory space parameters at least comprise index parameters corresponding to each dimension.
3. The vector processor-oriented image-to-matrix row conversion method of claim 2, wherein the blocking the first partition in the array memory space to obtain a second partition comprises:
and partitioning the first partition in the array memory space according to the array memory space parameters and the parameters of convolution calculation so as to obtain the second partition.
4. The vector processor-oriented image-to-matrix row conversion method of claim 3, wherein the width of the second partition is less than or equal to the total number of vector registers.
5. An image-to-matrix row conversion apparatus for a vector processor, comprising:
an acquisition module for acquiring a target image and storing the target image in a double-rate synchronous dynamic random access memory (DDR SDRAM) with a tag of
Figure 301857DEST_PATH_IMAGE002
The stored data is laid out as
Figure DEST_PATH_IMAGE004A
Wherein
Figure DEST_PATH_IMAGE006A
Which is indicative of the size of the batch process,
Figure DEST_PATH_IMAGE031
and
Figure DEST_PATH_IMAGE010A
representing the height and width of the input feature map,
Figure DEST_PATH_IMAGE012A
representing the width of data processed in parallel by the vector processing unit,
Figure DEST_PATH_IMAGE014A
representing the number of blocks of a channel on the input feature map;
a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operation
Figure DEST_PATH_IMAGE016A
Is marked by
Figure DEST_PATH_IMAGE018A
(ii) a Wherein
Figure DEST_PATH_IMAGE020A
Representing the filled data transferred to the array memory space of the target image,
Figure DEST_PATH_IMAGE022A
representing the filled height of the target image transferred into the array memory space,
Figure DEST_PATH_IMAGE024A
representing the width of the target image after being transmitted to the array memory space;
the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;
a storage module, configured to invoke the dma operation to store the converted matrix from the array memory space back to the ddr sdram;
the invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:
blocking the target image to obtain first blocks;
acquiring sub-matrix data corresponding to each first block;
calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to a corresponding position of the array memory space;
the vectorization processing of the target image to a matrix row by using the vector reading part and the vector writing part of the vector processor and obtaining the converted matrix comprises:
blocking the first tile in the array memory space to obtain a second tile;
acquiring sub-matrix data corresponding to each second block;
loading the sub-matrix data corresponding to each of the second partitions into a vector register set using the vector read component of the vector processor;
storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;
obtaining the converted matrix;
before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:
when the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is divided into a plurality of vector register groups
Figure DEST_PATH_IMAGE026AA
The sub-matrix data stored in the vector register are copied to the vector register group in sequence
Figure 551702DEST_PATH_IMAGE028
A plurality of said vector registers; wherein said
Figure DEST_PATH_IMAGE026AAA
Is the width overlap of two adjacent second partitions;
sequentially loading the sub-matrix data corresponding to each second partition into m vector registers of the vector register set by using the vector reading part of the vector processor, and entering a step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form a converted matrix; wherein the size of m is the width of the second partition
Figure DEST_PATH_IMAGE030A
And the above-mentioned
Figure 71545DEST_PATH_IMAGE028
The difference of (c).
6. An image-to-matrix row conversion apparatus for a vector processor, comprising:
a memory for storing a computer program;
processor for implementing the steps of the vector processor oriented image to matrix row conversion method as claimed in any one of claims 1 to 4 when executing said computer program.
7. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the vector processor oriented image-to-matrix row conversion method of any one of claims 1 to 4.
CN202211043942.9A 2022-08-30 2022-08-30 Vector processor-oriented image-to-matrix row conversion method, device and medium Active CN115114575B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211043942.9A CN115114575B (en) 2022-08-30 2022-08-30 Vector processor-oriented image-to-matrix row conversion method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211043942.9A CN115114575B (en) 2022-08-30 2022-08-30 Vector processor-oriented image-to-matrix row conversion method, device and medium

Publications (2)

Publication Number Publication Date
CN115114575A CN115114575A (en) 2022-09-27
CN115114575B true CN115114575B (en) 2023-01-31

Family

ID=83335344

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211043942.9A Active CN115114575B (en) 2022-08-30 2022-08-30 Vector processor-oriented image-to-matrix row conversion method, device and medium

Country Status (1)

Country Link
CN (1) CN115114575B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110796235A (en) * 2019-10-21 2020-02-14 中国人民解放军国防科技大学 Vectorization implementation method for Valid convolution of convolutional neural network
WO2020059156A1 (en) * 2018-09-18 2020-03-26 Nec Corporation Data processing system, method, and program
CN113806261A (en) * 2021-10-09 2021-12-17 中国人民解放军国防科技大学 Pooling vectorization implementation method for vector processor
CN114330669A (en) * 2021-12-30 2022-04-12 中国人民解放军国防科技大学 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system
CN114329326A (en) * 2021-12-10 2022-04-12 中国人民解放军国防科技大学 Low-bit-width data matrix vectorization column expansion method and system of vector processor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11417373B2 (en) * 2020-12-09 2022-08-16 Micron Technology, Inc. Neuromorphic computing devices and methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020059156A1 (en) * 2018-09-18 2020-03-26 Nec Corporation Data processing system, method, and program
CN110796235A (en) * 2019-10-21 2020-02-14 中国人民解放军国防科技大学 Vectorization implementation method for Valid convolution of convolutional neural network
CN113806261A (en) * 2021-10-09 2021-12-17 中国人民解放军国防科技大学 Pooling vectorization implementation method for vector processor
CN114329326A (en) * 2021-12-10 2022-04-12 中国人民解放军国防科技大学 Low-bit-width data matrix vectorization column expansion method and system of vector processor
CN114330669A (en) * 2021-12-30 2022-04-12 中国人民解放军国防科技大学 Vector processor-oriented semi-precision vectorization conv1 multiplied by 1 convolution method and system

Also Published As

Publication number Publication date
CN115114575A (en) 2022-09-27

Similar Documents

Publication Publication Date Title
JP6977239B2 (en) Matrix multiplier
CN109919311B (en) Method for generating instruction sequence, method and device for executing neural network operation
JP7329533B2 (en) Method and accelerator apparatus for accelerating operations
CN109213962B (en) Operation accelerator
EP3698313B1 (en) Image preprocessing for generalized image processing
CN108108811B (en) Convolution calculation method in neural network and electronic device
CN107844828B (en) Convolution calculation method in neural network and electronic device
CN111758107B (en) System and method for hardware-based pooling
US10769749B2 (en) Processor, information processing apparatus, and operation method of processor
CN110415157B (en) Matrix multiplication calculation method and device
WO2021080873A1 (en) Structured pruning for machine learning model
EP3093757B1 (en) Multi-dimensional sliding window operation for a vector processor
CN113673701A (en) Method for operating neural network model, readable medium and electronic device
KR20210014561A (en) Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
CN111125617A (en) Data processing method, data processing device, computer equipment and storage medium
CN113313247B (en) Operation method of sparse neural network based on data flow architecture
US20230259743A1 (en) Neural network accelerator with configurable pooling processing unit
US20090064120A1 (en) Method and apparatus to achieve maximum outer level parallelism of a loop
CN112966729B (en) Data processing method and device, computer equipment and storage medium
CN115114575B (en) Vector processor-oriented image-to-matrix row conversion method, device and medium
CN111931937B (en) Gradient updating method, device and system of image processing model
TW202234266A (en) Performing tensor operations using a programmable control engine
US11748100B2 (en) Processing in memory methods for convolutional operations
EP4300369A1 (en) Methods and systems for executing a neural network on a neural network accelerator
CN117851742A (en) Data storage method, data processing method, data memory and data processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant