CN115114575B

CN115114575B - Vector processor-oriented image-to-matrix row conversion method, device and medium

Info

Publication number: CN115114575B
Application number: CN202211043942.9A
Authority: CN
Inventors: 王庆林; 廖林玉; 尹尚飞; 梅松竹; 许金伟; 李东升; 姜晶菲; 苏华友; 李荣春; 符永铨; 刘杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2023-01-31
Anticipated expiration: 2042-08-30
Also published as: CN115114575A

Abstract

The application discloses a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and relates to the technical field of image processing. The method comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing (im 2 row) from an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. According to the method for performing the im2row operation in the vector processor, data are transmitted in a multi-dimensional array form in the transmission process through the DMA operation, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation can be improved.

Description

Vector processor-oriented image-to-matrix row conversion method, device and medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor.

Background

The image-to-matrix row conversion operation (im 2 row) is one of the important operations for the general convolution implementation in deep learning. On the vector processor, the conversion from the image with any size to the matrix row is efficiently realized, the general convolution performance on the vector processor can be effectively improved, and the convolution operation and the neural network type supported by the vector processor are greatly expanded.

Image-to-matrix row conversion operations are typically memory-intensive operations, so optimizing memory performance in im2row implementations is critical to speeding up im2 row. On a vector processor, there are generally two methods of implementation: the first is directly through scalar operations on a vector processor; the second is a Direct Memory Access (DMA) operation of the vector processor, which is implemented by Direct conversion on a Double Data Rate Dynamic Random Access (DDR) external to the vector processor. The performance of a DMA operation is directly related to the size of the block during its transfer, and generally speaking, the larger the block, the more efficiently the DMA component can exploit the memory bandwidth performance of the vector processor. For the first scalar operation, the operation equivalent to DMA is 1 element. For the second approach, the block size is typically 1 element or L elements, where L represents the data width that the vector processing unit processes in parallel. As described above, the block size of the DMA operation in the two common implementations is small, so that the memory access performance of the vector processor cannot be effectively exerted.

Therefore, the problem to be solved by the skilled person is how to efficiently implement im2row so as to improve the performance of the image-to-matrix row conversion operation on the vector processor.

Disclosure of Invention

The application aims to provide an image-to-matrix row conversion method, an image-to-matrix row conversion device and a medium for a vector processor, which are used for improving the image-to-matrix row conversion operation performance of the vector processor.

In order to solve the above technical problem, the present application provides a method for converting an image into a matrix row for a vector processor, including:

acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;

invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space;

performing vectorization conversion processing of an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;

and invoking the direct memory access operation to store the converted matrix from the array memory space back into the double-rate synchronous dynamic random access memory.

Preferably, the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:

blocking the target image to obtain first blocks;

acquiring sub-matrix data corresponding to each first block;

and calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space.

Preferably, the blocking the target image so as to obtain each first block includes:

blocking the target image according to the array memory space parameters so as to obtain each first block; the array memory space parameters at least comprise index parameters corresponding to all dimensions.

Preferably, the vector reading part and the vector writing part of the vector processor are used for carrying out vectorization conversion processing of the target image to matrix rows and obtaining a converted matrix, and the vectorization conversion processing comprises the following steps:

blocking the first tile in the array memory space to obtain a second tile;

acquiring sub-matrix data corresponding to each second block;

loading the sub-matrix data corresponding to each of the second partitions into a set of vector registers using the vector read component of the vector processor;

storing, by the vector write component of the vector processor, sub-matrix data corresponding to each of the second partitions in the vector register set in a corresponding location to form the converted matrix;

and acquiring the converted matrix.

Preferably, said blocking said first partition in said array memory space to obtain a second partition comprises:

and partitioning the first partition in the array memory space according to the array memory space parameters and the parameters of convolution calculation so as to obtain the second partition.

Preferably, the width of the second partition is less than or equal to the total number of vector registers.

Preferably, before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:

under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged at the back

The sub-matrix data stored in the vector register is copied to the vector register group in sequence before being copied

A plurality of said vector registers; wherein said

Is the width overlap of two adjacent second partitions;

sequentially loading the sub-matrix data corresponding to each second partition into the last m vector registers in the vector register set by using the vector reading part of the vector processor, and entering the step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form the converted matrix; wherein the size of m is the width of the second partition

And is as described above

The difference of (a).

In order to solve the above technical problem, the present application further provides an image-to-matrix row conversion apparatus facing a vector processor, including:

the device comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring a target image and storing the target image into a double-rate synchronous dynamic random access memory;

a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space;

the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor;

and the storage module is used for calling the direct memory access operation to store the converted matrix from the array memory space back to the double-rate synchronous dynamic random access memory.

a memory for storing a computer program;

and the processor is used for realizing the steps of the image-to-matrix row conversion method facing the vector processor when executing the computer program.

In order to solve the above technical problem, the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above image-to-matrix row conversion method for a vector processor.

The method for converting the image facing to the vector processor into the matrix row comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load the target image from the DDR to the AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the converted matrix from AM space back into the DDR. Compared with the prior method for directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and the method for performing im2row operation in the vector processor provided by the application has the advantages that data are transmitted in a multidimensional array form through the DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is remarkably increased, the number of times of the DMA operation is greatly reduced, and therefore the storage bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is remarkably improved.

In addition, the application also provides an image-to-matrix row conversion device facing the vector processor and a computer readable storage medium, which correspond to the above mentioned method for converting the image facing the vector processor into the matrix row, and the effects are the same.

Drawings

In order to more clearly illustrate the embodiments of the present application, the drawings required for the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained by those skilled in the art without inventive effort.

Fig. 1 is a block diagram of a vector processor according to an embodiment of the present disclosure;

fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application;

FIG. 3 is a block diagram provided in an embodiment of the present application;

fig. 4 is a flowchart of a vectorization processing method for implementing vectorization conversion processing from an input image to a matrix row in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application;

FIG. 5 is a flow chart of a method for completing a transformation provided by an embodiment of the present application;

FIG. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application;

FIG. 7 is a block diagram of an image-to-matrix row conversion apparatus for a vector processor according to an embodiment of the present application;

fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the present application.

The core of the application is to provide a vector processor-oriented method, a vector processor-oriented device and a vector processor-oriented medium for converting an image into a matrix row, and the vector processor-oriented medium is used for improving the performance of the image-to-matrix row conversion on the vector processor.

In order that those skilled in the art will better understand the disclosure, the following detailed description will be given with reference to the accompanying drawings. Fig. 1 is a block diagram of a vector processor according to an embodiment of the present application. As shown in fig. 1, the Vector processor includes a Scalar Processing Unit (SPU) that performs Scalar operations, a Vector Processing Unit (VPU) that performs Vector operations, a Direct Memory Access (DMA) component that is responsible for data transfer, and the like. The SPU is composed of a Scalar Processing Element (SPE) and a Scalar Memory (SM). The VPU is composed of J Vector Processing Elements (VPEs) and an Array Memory (AM), and the J Vector Processing Elements (VPEs) cooperatively operate in a Single Instruction Multiple Data (SIMD) manner, support the turning-off and turning-on of a specific VPE Element, but do not support Data interaction between Multiple VPEs. A single VPE can process 18 bytes of data (e.g., FP64, int 64) or 24 bytes of data (e.g., FP32, int 32) at a time. The DMA component is responsible for data transfer between SM and DDR, AM and DDR, with a minimum granularity of operation of also 8 bytes. Fig. 2 is a flowchart of an image-to-matrix row conversion method for a vector processor according to an embodiment of the present application, where as shown in fig. 2, the method includes:

s1: and acquiring a target image and storing the target image into the DDR.

The target image is an image that needs convolution calculation. Storing the convolution computed input feature map (input image) data matrix in the DDR, labeled

Stored data layout

In which

Which is indicative of the size of the batch process,

and

representing the height and width of the input feature map,

indicating the data width of the parallel processing by the vector processing unit (when the data type is FP64, int64, L = J; when the data type is FP32, int32,

），

representing the number of blocks of a channel on the input profile, the number of channels of the input profile being

. The height and width of the convolution kernel in the convolution calculation are respectively

And

step size S, fill size

. Thus, the final result matrix of im2row operations is stored in the DDR, labeled

The data layout is

Wherein, in the step (A),

、

respectively representing the height and width of the output characteristic diagram of the convolution calculation, and the calculation formulas are respectively shown as formulas (1) and (2), wherein

Meaning rounding down.

（1）

（2）

S2: and calling DMA operation to load the target image from the DDR to the AM space.

In order to improve the access performance of the vector processor, the im2row operation which is performed off chip is converted into the operation which is performed on chip, and the DMA operation is called to load the target image from the DDR to the AM space.

S3: performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; wherein the target image is in the form of a multi-dimensional array in the vector processor.

Since the Load component (i.e., vector read component) and the Store component (i.e., vector write component) of the vector processor can implement the vectorization conversion process of the image into matrix rows, im2row operation is performed on the AM space by using the Load component and the Store component, resulting in a converted matrix.

S4: the DMA operation is invoked to store the translated matrix from the AM space back into the DDR.

In order to improve the access performance of the vector processor, the target image is loaded from the DDR to the AM space through the DMA operation, and after im2row operation is performed on the AM space, the converted matrix needs to be stored from the AM space back to the DDR. In storing from AM space back to DDR, this is still accomplished by invoking DMA operations.

The method for converting an image to a matrix row facing a vector processor provided by the embodiment comprises the following steps: acquiring a target image and storing the target image into the DDR; calling DMA operation to load a target image from the DDR to an AM space; performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix; the DMA operation is invoked to store the translated matrix from the AM space back into the DDR. Compared with the prior method of directly performing im2row operation on a DDR through scalar operation or DMA operation, the DMA operation is one element or L elements, and according to the method of performing im2row operation in a vector processor provided by this embodiment, data is transmitted in a multidimensional array form through DMA operation between the DDR and the AM, the size of a transmission block of the DMA operation is significantly increased, and the number of times of the DMA operation is greatly reduced, so that the memory bandwidth performance of the vector processor can be effectively exerted, and the performance of the im2row operation is significantly improved.

In implementation, in order to load the target image from the DDR to the AM space, preferably, the method includes the following steps:

blocking the target image to obtain first blocks;

acquiring sub-matrix data corresponding to each first partition;

and calling DMA operation to load the sub-matrix data corresponding to each first partition from the DDR to the corresponding position of the AM space.

Since the AM space is limited, the target image needs to be loaded from the DDR to the AM space as a plurality of first blocks. Marking AM space size

The original block input feature icon of each time the AM space is transmitted is marked as

Size mark is

(ii) a After being transmitted into AM space, the AM space is expanded into

Is marked as

(ii) a The im2row result produced was of the size

Is marked by

In which

，

. Due to the limitation of AM space size

And with

The total space size does not exceed

Namely:

wherein

And (3) representing the number of bytes of the memory space occupied by each element in the convolution calculation, wherein t =4 if the convolution calculation adopts a single-precision floating point number FP 32. Thus, im is determinedBlock size of 2row operation outgoing AM space

And

。

the embodiment of the application provides a method for loading one block in a target image from a DDR to an AM space by calling DMA operation. The method comprises the following steps:

s201: initialization

；

Where N represents an index in the N dimension;

s202: initialization

；

Wherein

To represent

An index in a dimension;

s203: initialization

；

Wherein

To represent

An index in a dimension;

s204: computing

Dimension actual block size

；

Wherein min represents the minimum of the two numbers;

s205: initialization

；

Wherein

To represent

A start index in a dimension;

s206: judgment of

If yes, go to step S207; if not, go to step S208;

s207: initialization

And make an order

The flow advances to step S209;

wherein

Represent

An index of (a);

s208: initialization

；

S209: initialization

；

To represent

End index in dimension, min represents the minimum value of two numbers;

s210: calculating out

；

Representing an incoming AM space from a DDR

A block size in dimension;

s211: initialization

；

Wherein

To represent

An index in a dimension;

s212: calculating out

Dimension actual block size

；

Wherein min represents the minimum of the two numbers;

s213: initialization

；

Wherein

Represent

A start index in a dimension;

s214: judgment of

If not, the process goes to step S215; if not, go to step S216;

s215: initialization

And make an order

The flow advances to step S217;

wherein

Represent

An index in a dimension;

s216: initialization

；

S217: initialization

；

To represent

A dimensional end index;

s218: computing

；

Representing an incoming AM space from a DDR

A block size in dimension;

s219: in AM is

All elements of the space are all initialized to 0.

S220: invoking a Direct Memory Access (DMA) operation will

The size of the position is

Into on-chip AM space

At the location.

To be provided with

、

、

、

For example, table 1 shows the AM space

Assuming that it is in AM space at this time

The partial data of (a) are as follows:

TABLE 1 in AM space

Part of data of

When the target image is loaded to the AM space from the DDR by calling the DMA operation provided by the embodiment, the target image is blocked into a plurality of first blocks, and the target image can be completely loaded to the AM space by each block; the vector processor can load each block from DDR to AM space in order through the block processing.

In implementation, when loading the target image from the DDR to the AM space, in order to reasonably block the target image, it is preferable that the blocking the target image so as to obtain each first block includes:

blocking the target image according to the AM space parameters so as to obtain first blocks; the AM space parameters at least comprise index parameters corresponding to all dimensions.

When the target image is blocked, the target image can be reasonably blocked according to the parameters of the AM space. The parameters in the AM space at least include index parameters corresponding to each dimension. The procedure shown in the method for loading a block in a target image from a DDR to an AM space by calling a DMA operation in the above embodiment is according to

、

The corresponding index parameter is finally obtained

Dimension, degree,

The size of the partition in dimension.

The target image is loaded from DDR to the AM space according to the parameters on the AM space, so that the target image can be reasonably partitioned.

In implementation, in order to facilitate im2row operation on the first partition, it is preferable that performing vectorization conversion processing of an image of a target image into matrix rows using a Load component and a Store component of a vector processor and acquiring a converted matrix includes:

partitioning the first partition in an AM space to obtain a second partition;

acquiring sub-matrix data corresponding to each second block;

loading the sub-matrix data corresponding to each second partition into a vector register set by using a Load component of the vector processor;

storing, by a Store component of the vector processor, sub-matrix data corresponding to each second partition in the vector register set at a corresponding location to form a converted matrix;

and acquiring the converted matrix.

In that

Is

The blocks are divided in dimension, each block is called a line input window WinInput, and the width of each block is marked as WinInput

Between two adjacent line input windows

Overlap of size width wherein

. Outputting convolution corresponding to a line input window

Dimensional block size marking

Having a value of

. In view of

Must be less than or equal to the total number of available vector registers in the vector processor

And an

Must be a limitation of integers, i.e.

While satisfying

Wherein

The remainder is expressed. Thereby determining

The value of (c).

To be provided with

、

、

、

For the purpose of example, it is preferred that,

taking out

. When in use

The blocks are shown in fig. 3. Fig. 3 is a block diagram provided in an embodiment of the present application. From a0 to a5 is a first line input window, from a4 to a9 is a second line input window, and a4 and a5 are overlapping portions of the first and second line input windows. Fig. 4 is a flowchart of a processing method for implementing vectorization conversion from an input image to matrix rows in an AM space by a Load/Store component of a vector processor according to an embodiment of the present application. As shown in fig. 4, the method includes:

s31: in that

Is/are as follows

Dimensionally chunking and determining

A value;

s32: initialization

；

Wherein

To represent

An index in a dimension;

s33: initialization

；

Wherein

Represent

An index in a dimension;

s34: by vector processorsA Load component for inputting the feature map of the blocks in the AM space

The size of the position is

Sequentially loading the submatrix data into a vector register set.

To be provided with

、

、

、

、

For example, when

、

In order to do

The data of the size is loaded into the vector register sets VR0, VR1, VR2, VR3, VR4, VR5 at a time, and the result is shown in Table 2 below, where Table 2 is the data to be loaded into

The size data is loaded into the data of the vector register set at once.

TABLE 2 will

Data of sizeData loaded into vector register set at once

As can be seen from table 2, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a10, a11, a12, a13, a14, a15, respectively.

S35: storing data in a vector register set by a Store component of a vector processor

The corresponding position of (a).

The embodiment of the application provides a method for storing data in a vector register set

The method of (a), the method comprising:

s3501: initialization

；

Wherein

Represented in a line input window

An index in a dimension;

s3502: initialization

；

Wherein

Is shown in

An index in a dimension;

s3503: computing

；

Wherein

Is shown in

The index of (a);

s3504: judgment of

If yes, executing step S3505, otherwise, jumping to step S3511;

s3505: initialization

；

Wherein

Is shown in

An index in a dimension;

s3506: calculating out

；

Wherein

Is shown in

An index in a dimension;

s3507: judgment of

If yes, go to step S3508,if not, jumping to step S3509;

s3508: store component based vector processor

Is stored in

A location;

s3509: order to

；

S3510: judgment of

Whether or not less than

If yes, the process returns to step S3506, and if no, step S3511 is executed.

S3511: order to

；

S3512: judgment of

Whether or not less than

If yes, returning to the step S3503, otherwise, executing the step S3513;

s3513: order to

；

S3514: judgment of

If the value is greater than or equal to 0, jumping to step S3502 if the value is greater than or equal to 0, and otherwise, executing step S3515;

s3515: order to

，

；

Wherein

To represent

A load index in a dimension;

s3516: judgment of

Whether or not less than

If yes, go to step S3517; if not, jumping to the step S3518;

s3517: updating the data of the vector register group and returning to the step S3501;

s3518: order to

；

S3519: judgment of

Whether or not less than

If yes, go to step S33 of fig. 4; if not, the step S41 is executed;

s33: initialization

；

Wherein

To represent

An index in a dimension;

s41: invoking Direct Memory Access (DMA) operations into AM space

The size of the position is

To double rate synchronous dynamic random access memory

To (3).

To be provided with

、

、

、

、

For example, when

、

When the data is needed, the data loaded to VR0, VR1, VR2, VR3, VR4 and VR5 are stored in

After corresponding position, it

The data in (1) are as follows: TABLE 3 storage of data in vector register set

In

For the partial data of the 0 position, it should be noted that "-" in table 3 indicates data irrelevant to this storing operation.

TABLE 3 store data in vector register set

In

Partial data of 0 position

Since im2row operation is performed on each partition in sequence in the present application, after the converted current matrix is stored from the AM space back to the DDR, it is necessary to determine whether processing is completed on each partition. Fig. 5 is a flowchart of a method for completing a conversion according to an embodiment of the present application. As shown in fig. 5, the method includes:

s42: completing the conversion of the current block;

s43: order to

；

S44: judgment of

Whether or not less than

If yes, go back to step S212; if not, executing step S45;

s45: order to

；

S46: judgment of

Whether or not less than

If yes, returning to the step S204, otherwise, executing the step S47;

s47: order to

；

S48: judgment of

Whether or not less than

If yes, returning to step S203, otherwise, executing step S49;

s49: order to

；

S50: judgment of

If the value is less than N, returning to the step S202 if the value is less than N, otherwise, executing the step S51;

s51: the conversion is completed.

In the method for performing vectorization conversion processing of an image to a matrix row on a target image by using a Load component and a Store component of a vector processor and acquiring a converted matrix provided by this embodiment, the first block data is blocked again, and im2row operations are performed respectively, so that the memory access performance of the image processor is greatly improved.

In implementation, in order to be able to more accurately perform blocking again on the first partition, it is a preferred embodiment that blocking the first partition in the AM space so as to obtain the second partition includes:

and partitioning the first partition in the AM space according to the AM space parameters and the convolution calculation parameters so as to obtain a second partition.

In the above embodiment, the second partitions have been described, and the row input windows corresponding to the second partitions are obtained

Therefore, the method of partitioning the first partition again in the AM space according to the AM space parameters and the convolution calculated parameters will not be described again here.

The second partition obtained by blocking each first partition again provided by this embodiment performs im2row operation, so as to improve the memory access performance of the vector processor.

In implementation, since the number of vector registers that can be used in vector processing is limited, in order to enable the second partition to be processed by the available vector registers, it is preferred that the width of the second partition is less than or equal to the total number of vector registers.

The width of the second partition provided by the present embodiment is less than or equal to the total number of vector registers, so that the data of the second partition can be processed by the vector register bank.

When updating the data of the vector register, the preferred embodiment further includes, before storing, by the Store component of the vector processor, the sub-matrix data corresponding to each second partition in the vector register set in the corresponding position to form the converted matrix:

under the condition that the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is arranged in the later stage

Before the sub-matrix data stored in the vector register is copied to the vector register group in sequence

A vector register; wherein

The size of (b) is the size of the width overlap of two adjacent second blocks;

sequentially loading the sub-matrix data corresponding to each second partition into the m vector registers in the vector register group by using a Load component of the vector processor, and storing the sub-matrix data corresponding to each second partition in the vector register group at a corresponding position by using a Store component of the vector processor to form a converted matrix; wherein m is the width of the second partition

And

the difference of (c).

In implementation, the specific process for updating the data of the vector register set of step S3517 is shown in fig. 6. Fig. 6 is a flowchart of a method for updating data of a vector register set according to an embodiment of the present application, where the method includes:

s35171: will be provided with

The data of the vector register is copied to the front in sequence

A register;

to be provided with

、

、

、

、

For the purpose of example only,

when is coming into contact with

、

In order to do so

The data in the vector register is copied to the front

The results of the registers are shown in Table 4, and Table 4 is the following

The data in the vector register is copied to the front

Data of each register.

TABLE 4 description

The data in the vector register is copied to the front

Data of a register

As seen from table 4, the data in the vector registers VR0, VR1, VR2, VR3, VR4, VR5 are a14, a15, a12, a13, a14, a15, respectively.

S35172: inputting the image through a Load part of a vector processor

The size of the position is

After the sub-matrix data are loaded to the vector register group in sequence

In a vector register.

To be provided with

、

、

、

、

、

For example, when

、

、

Input an image

The size of the position is

After the sub-matrix data are loaded to the input window vector register group in sequence

The results of the vector registers are shown in table 5, and table 5 shows the data of the updated vector processor group.

TABLE 5 data of the updated vector processor set

As shown in table 5, the data in the vector registers VR0, VR1, VR2, VR3, VR4, and VR5 are a14, a15, a16, a17, a18, and a19, respectively, that is, the update of the data in the vector register group is realized.

The data updating method provided by the embodiment enables the data of each second partition to be processed, and improves the access performance of the vector processor.

In the above embodiments, the method for converting an image into a matrix row is described in detail, and the present application also provides a corresponding embodiment of the vector processor-oriented image-to-matrix row conversion apparatus. It should be noted that the present application describes the embodiments of the apparatus portion from two perspectives, one from the perspective of the function module and the other from the perspective of the hardware.

Fig. 7 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to an embodiment of the present application. The present embodiment is based on the angle of the function module, including:

the acquisition module 10 is used for acquiring a target image and storing the target image into the DDR;

the loading module 11 is configured to invoke a DMA operation to load a target image from a DDR to an AM space;

a conversion processing module 12, configured to perform vectorization conversion processing on the target image from an image to a matrix row by using a Load component and a Store component of the vector processor, and obtain a converted matrix; wherein, the target image exists in the form of multi-dimensional array in the vector processor;

and the storage module 13 is used for calling DMA operation to store the converted matrix from the AM space back to the DDR.

Since the embodiment of the apparatus portion and the embodiment of the method portion correspond to each other, please refer to the description of the embodiment of the method portion for the embodiment of the apparatus portion, and details are not repeated here.

Since the above mentioned image-to-matrix row conversion method facing the vector processor has corresponding technical features with the image-to-matrix row conversion apparatus facing the vector processor of the present embodiment, the image-to-matrix row conversion apparatus facing the vector processor provided by the present embodiment has the same beneficial effects as the above mentioned image-to-matrix row conversion method facing the vector processor.

Fig. 8 is a block diagram of an image-to-matrix row conversion apparatus facing a vector processor according to another embodiment of the present application. This embodiment is based on a hardware perspective, and as shown in fig. 8, the apparatus includes:

a memory 20 for storing a computer program;

a processor 21 for implementing the steps of the vector processor oriented method of image to matrix row conversion as mentioned in the above embodiments when executing the computer program.

The image-to-matrix row conversion apparatus for vector processors provided in this embodiment may include, but is not limited to, a smart phone, a tablet computer, a notebook computer, or a desktop computer.

The processor 21 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The Processor 21 may be implemented in hardware using at least one of a Digital Signal Processor (DSP), a Field-Programmable Gate Array (FPGA), and a Programmable Logic Array (PLA). The processor 21 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 21 may be integrated with a Graphics Processing Unit (GPU) which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 21 may further include an Artificial Intelligence (AI) processor for processing computational operations related to machine learning.

Memory 20 may include one or more computer-readable storage media, which may be non-transitory. Memory 20 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 20 is at least used for storing a computer program 201, wherein the computer program is loaded and executed by the processor 21, and then the relevant steps of the method for converting an image into a matrix row disclosed in any one of the foregoing embodiments can be implemented. In addition, the resources stored in the memory 20 may also include an operating system 202, data 203, and the like, and the storage manner may be a transient storage manner or a permanent storage manner. Operating system 202 may include, among others, windows, unix, linux, and the like. Data 203 may include, but is not limited to, data involved in the vector processor-oriented image-to-matrix row conversion methods mentioned above, and the like.

In some embodiments, the vector processor-oriented image-to-matrix row conversion device may further include a display screen 22, an input-output interface 23, a communication interface 24, a power supply 25, and a communication bus 26.

Those skilled in the art will appreciate that the architecture shown in fig. 8 does not constitute a limitation of the image-to-matrix row conversion means of the vector-oriented processor and may include more or fewer components than those shown.

The image-to-matrix row conversion device facing the vector processor provided by the embodiment of the application comprises a memory and a processor, wherein when the processor executes a program stored in the memory, the following method can be realized: the effect of the image-to-matrix row conversion method facing the vector processor is the same as that of the image-to-matrix row conversion method.

Finally, the application also provides a corresponding embodiment of the computer readable storage medium. The computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps as set forth in the above-mentioned method embodiments.

It is to be understood that if the method in the above embodiments is implemented in the form of software functional units and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and executes all or part of the steps of the methods described in the embodiments of the present application, or all or part of the technical solutions. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The computer-readable storage medium provided by the present application includes the above-mentioned image-to-matrix row conversion method for vector processors, and the effects are the same as above.

The present application provides a method, an apparatus, and a medium for converting an image into a matrix row for a vector processor. The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part. It should be noted that, for those skilled in the art, without departing from the principle of the present application, the present application can also make several improvements and modifications, and those improvements and modifications also fall into the protection scope of the claims of the present application.

It should also be noted that, in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A method for converting an image into a matrix row facing a vector processor, comprising:

obtaining a target image and storing the target image in a double rate synchronous dynamic random access memory (DDR SDRAM), and marking

The stored data layout is

In which

Which is indicative of the size of the batch process,

and

representing the height and width of the input feature map,

representing the width of data processed in parallel by the vector processing unit,

representing the number of blocks of a channel on the input feature map;

invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory into an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operation

Is marked as

(ii) a Wherein

Representing the filled data transferred to the array memory space of the target image,

representing the filled height of the target image transferred into the array memory space,

representing a width of the target image after transfer to the array memory space;

performing vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of a vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;

invoking the direct memory access operation to store the converted matrix from the array memory space back into the double rate synchronous dynamic random access memory;

the invoking of the direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:

blocking the target image to obtain first blocks;

acquiring sub-matrix data corresponding to each first block;

calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first block from the double-rate synchronous dynamic random access memory to the corresponding position of the array memory space;

the performing image-to-matrix row vectorization conversion processing on the target image by using a vector reading part and a vector writing part of a vector processor and acquiring a converted matrix comprises:

partitioning the first partition in the array memory space to obtain a second partition;

acquiring sub-matrix data corresponding to each second block;

loading the sub-matrix data corresponding to each of the second partitions into a vector register set using the vector read component of the vector processor;

obtaining the converted matrix;

before the vector writing unit of the vector processor stores sub-matrix data corresponding to each of the second blocks in the vector register set in a corresponding position to form the converted matrix, the method further includes:

when the load index corresponding to the current dimension is smaller than the size of the current dimension, the vector register set is divided into a plurality of vector register groups

The sub-matrix data stored in the vector register are copied to the vector register group in sequence

A plurality of said vector registers; wherein said

Is the width overlap of two adjacent second partitions;

sequentially loading the sub-matrix data corresponding to each second partition into m vector registers of the vector register set by using the vector reading part of the vector processor, and entering a step of storing the sub-matrix data corresponding to each second partition in the vector register set at a corresponding position by using the vector writing part of the vector processor to form a converted matrix; wherein the size of m is the width of the second partition

And the above-mentioned

The difference of (c).

2. The vector processor-oriented image-to-matrix row conversion method of claim 1, wherein the blocking the target image to obtain first blocks comprises:

blocking the target image according to the array memory space parameters so as to obtain each first block; wherein the array memory space parameters at least comprise index parameters corresponding to each dimension.

3. The vector processor-oriented image-to-matrix row conversion method of claim 2, wherein the blocking the first partition in the array memory space to obtain a second partition comprises:

4. The vector processor-oriented image-to-matrix row conversion method of claim 3, wherein the width of the second partition is less than or equal to the total number of vector registers.

5. An image-to-matrix row conversion apparatus for a vector processor, comprising:

an acquisition module for acquiring a target image and storing the target image in a double-rate synchronous dynamic random access memory (DDR SDRAM) with a tag of

The stored data is laid out as

Wherein

Which is indicative of the size of the batch process,

and

representing the height and width of the input feature map,

representing the number of blocks of a channel on the input feature map;

a loading module for invoking a direct memory access operation to load the target image from the double-rate synchronous dynamic random access memory to an array memory space; after being transmitted into the array memory space, the data is expanded into the array memory space through a filling operation

Is marked by

(ii) a Wherein

representing the width of the target image after being transmitted to the array memory space;

the conversion processing module is used for carrying out vectorization conversion processing from an image to a matrix row on the target image by using a vector reading component and a vector writing component of the vector processor and acquiring a converted matrix; wherein the target image is present in the vector processor in the form of a multi-dimensional array;

a storage module, configured to invoke the dma operation to store the converted matrix from the array memory space back to the ddr sdram;

the invoking a direct memory access operation to load the target image from the double rate synchronous dynamic random access memory to an array memory space comprises:

blocking the target image to obtain first blocks;

acquiring sub-matrix data corresponding to each first block;

calling the direct memory access operation to respectively load the sub-matrix data corresponding to each first partition from the double-rate synchronous dynamic random access memory to a corresponding position of the array memory space;

the vectorization processing of the target image to a matrix row by using the vector reading part and the vector writing part of the vector processor and obtaining the converted matrix comprises:

blocking the first tile in the array memory space to obtain a second tile;

acquiring sub-matrix data corresponding to each second block;

obtaining the converted matrix;

A plurality of said vector registers; wherein said

Is the width overlap of two adjacent second partitions;

And the above-mentioned

The difference of (c).

6. An image-to-matrix row conversion apparatus for a vector processor, comprising:

a memory for storing a computer program;

processor for implementing the steps of the vector processor oriented image to matrix row conversion method as claimed in any one of claims 1 to 4 when executing said computer program.

7. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the vector processor oriented image-to-matrix row conversion method of any one of claims 1 to 4.