CN117437114A

CN117437114A - Image data processing method, device, electronic equipment and medium

Info

Publication number: CN117437114A
Application number: CN202311598934.5A
Authority: CN
Inventors: 李钦; 吉祥虎; 杨超; 冯新宇; 党衍斌; 罗正航; 刘凌志
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Lingchuan Technology Co ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-01-23

Abstract

The embodiment of the disclosure provides an image data processing method, an image data processing device, electronic equipment and a medium. The method is applied to a systolic array architecture comprising a systolic array, the method comprising: cutting the image to be processed to obtain an image block to be processed; obtaining original image data of an image block to be processed, and performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format, wherein the preset data format is a format of continuously storing blocks according to the sequence of height, width and channels; inputting target image data into a pulsation array, and carrying out convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed. The method adopts the preset data format to represent the target image data, so that the target image data can be immediately used for matrix block operation, address conversion and judgment are not needed to be considered, the memory bandwidth waste is reduced, and the maximum calculation efficiency can be ensured when the pulse array carries out convolution calculation.

Description

Image data processing method, device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of image technology processing, and in particular, to an image data processing method, an image data processing apparatus, an electronic device, and a computer-readable storage medium.

Background

Systolic arrays are a type of hardware accelerator specifically designed for artificial intelligence applications that use custom computational units to perform the computational tasks common in deep learning such as matrix multiplication and convolution. In image processing, systolic arrays may be applied for various tasks such as image filtering, morphological operations, feature extraction, etc. Due to the characteristic of parallel processing of the systolic array, the method can efficiently process large-scale image data, so that the method has wide application prospect in image processing.

In the systolic array-based image processing, image data is segmented into a plurality of small blocks, and then the segmented small blocks are distributed to a calculation unit in a systolic array for parallel calculation. However, the existing image data format has problems of memory bandwidth waste and low convolution calculation efficiency on the systolic array during the processing process.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The embodiment of the disclosure provides an image data processing method, an image data processing device, electronic equipment and a computer readable storage medium, which are used for solving the problems of memory bandwidth waste and low convolution calculation efficiency on a pulse array existing in the processing process of the existing image data format to at least a certain extent.

An embodiment of the present disclosure provides an image data processing method applied to a systolic array architecture, the systolic array architecture including a systolic array, the method including: cutting the image to be processed to obtain one or more image blocks to be processed; obtaining original image data of the image block to be processed, and performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format; the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel; and inputting the target image data into the pulsation array, and performing convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises a dynamic random access memory and a static random access memory; the obtaining the original image data of the image block to be processed, performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format, including: determining position information of the image block to be processed in the dynamic random access memory; acquiring the original image data stored in the dynamic random access memory according to the determined position information; and reading the original image data from the dynamic random access memory to the static random access memory, and performing format conversion processing on the original image data in the data reading process to obtain the target image data.

In some exemplary embodiments of the disclosure, the performing format conversion processing on the original image data to obtain the target image data includes: acquiring the data precision type of the image to be processed and the byte number of the single read data of the static random access memory; determining a target matrix row number and a target rectangular column number according to the data precision type of the image to be processed and the byte number of the single read data of the static random access memory; and carrying out data filling processing on the original image data according to the target matrix row number and the target matrix column number to obtain the target image data.

In some exemplary embodiments of the present disclosure, the determining the target matrix row number and the target rectangular column number according to the data precision type of the image to be processed and the byte number of the single read data of the sram includes: calculating the data quantity of the single read data meeting the data precision type of the image to be processed according to the byte number of the single read data of the static random access memory; and determining the number of rows of the target matrix and the number of columns of the target matrix according to the data quantity of the single read data.

In some exemplary embodiments of the present disclosure, the raw image data includes pixel data of each pixel point in the image block to be processed, the pixel data including coordinate positions of each pixel point and data of each pixel point in a channel direction; the step of performing data filling processing on the original image data according to the target matrix row number and the target matrix column number to obtain the target image data includes: according to the coordinate positions of the pixel points, sequentially ordering the data of the pixel points in the channel direction according to the rows to obtain ordered image data; the matrix column number of the ordered image data is the channel number corresponding to the original image data, and the matrix column number of the ordered image data is the pixel number contained in the original image data; performing data filling processing on the ordered image data in the column direction according to the number of columns of the target matrix, and performing data filling processing on the ordered image data in the row direction according to the number of rows of the target matrix to obtain the target image data; the matrix column number of the target image data is the target matrix column number, and the matrix column number of the target image data is the target matrix column number.

In some exemplary embodiments of the present disclosure, the method further comprises: acquiring parameter data, and segmenting the parameter data to acquire one or more segmented parameter data; and determining target parameter data corresponding to the target image data from the one or more sliced parameter data.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises an input cache and a parameter cache; the step of inputting the target image data into the systolic array, and performing convolution calculation on the target image data through the systolic array to obtain a convolution calculation result corresponding to the image block to be processed, wherein the step of obtaining the convolution calculation result comprises the following steps: reading the target image data into the input buffer memory, and reading the target parameter data into the parameter buffer memory; inputting the target image data from the input buffer to the systolic array, and inputting the target parameter data from the parameter buffer to the systolic array; and carrying out convolution calculation on the target image data and the target parameter data through the pulsation array to obtain a convolution calculation result.

According to another aspect of the disclosed embodiments, there is provided an image data processing apparatus applied to a systolic array architecture including a systolic array, the apparatus comprising: the segmentation unit is configured to segment the image to be processed to obtain one or more image blocks to be processed; a format conversion unit configured to obtain original image data of the image block to be processed, and perform format conversion processing on the original image data to obtain target image data arranged according to a preset data format; the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel; the calculation unit is configured to input the target image data into the pulsation array, and perform convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises a dynamic random access memory and a static random access memory; wherein the format conversion unit is further configured to: determining position information of the image block to be processed in the dynamic random access memory; acquiring the original image data stored in the dynamic random access memory according to the determined position information; and reading the original image data from the dynamic random access memory to the static random access memory, and performing format conversion processing on the original image data in the data reading process to obtain the target image data.

In some exemplary embodiments of the present disclosure, the format conversion unit is further configured to: acquiring the data precision type of the image to be processed and the byte number of the single read data of the static random access memory; determining a target matrix row number and a target rectangular column number according to the data precision type of the image to be processed and the byte number of the single read data of the static random access memory; and carrying out data filling processing on the original image data according to the target matrix row number and the target matrix column number to obtain the target image data.

In some exemplary embodiments of the present disclosure, the format conversion unit is further configured to: calculating the data quantity of the single read data meeting the data precision type of the image to be processed according to the byte number of the single read data of the static random access memory; and determining the number of rows of the target matrix and the number of columns of the target matrix according to the data quantity of the single read data.

In some exemplary embodiments of the present disclosure, the raw image data includes pixel data of each pixel point in the image block to be processed, the pixel data including coordinate positions of each pixel point and data of each pixel point in a channel direction; wherein the format conversion unit is further configured to: according to the coordinate positions of the pixel points, sequentially ordering the data of the pixel points in the channel direction according to the rows to obtain ordered image data; the matrix column number of the ordered image data is the channel number corresponding to the original image data, and the matrix column number of the ordered image data is the pixel number contained in the original image data; performing data filling processing on the ordered image data in the column direction according to the number of columns of the target matrix, and performing data filling processing on the ordered image data in the row direction according to the number of rows of the target matrix to obtain the target image data; the matrix column number of the target image data is the target matrix column number, and the matrix column number of the target image data is the target matrix column number.

In some exemplary embodiments of the present disclosure, the segmentation unit is further configured to: acquiring parameter data, and segmenting the parameter data to acquire one or more segmented parameter data; and determining target parameter data corresponding to the target image data from the one or more sliced parameter data.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises an input cache and a parameter cache; wherein the computing unit is further configured to: reading the target image data into the input buffer memory, and reading the target parameter data into the parameter buffer memory; inputting the target image data from the input buffer to the systolic array, and inputting the target parameter data from the parameter buffer to the systolic array; and carrying out convolution calculation on the target image data and the target parameter data through the pulsation array to obtain a convolution calculation result.

An embodiment of the present disclosure provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute executable instructions to implement the image data processing method as in any of the above.

The disclosed embodiments provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform an image data processing method as any one of the above.

The disclosed embodiments provide a computer program product comprising a computer program which, when executed by a processor, implements the image data processing method of any one of the above.

According to the image data processing method provided by the embodiment of the disclosure, an image to be processed is segmented into one or more image blocks to be processed, convolution calculation is performed on image data of each image block to be processed through a pulse array, a convolution calculation result is obtained, in the process of performing convolution calculation on the image data of the image block to be processed, original image data of the image block to be processed is firstly obtained, then format conversion is performed on the original image data, target image data arranged according to a preset data format is obtained, namely the original image data is converted into a format of continuously storing blocks according to the sequence of the height, the width and the channel, and further convolution calculation is performed on the target image data through the pulse array. The embodiment of the disclosure proposes that the target image data is represented by adopting a preset data format, so that the target image data adopts a partitioned data structure, and the target image data is stored according to the sequence of the height, the width and the channels, so that the target image data comprises the data of each channel, the read target image data can be immediately used for the block operation of a matrix, address conversion and judgment are not required to be considered, the memory bandwidth waste can be reduced, the problems of the memory bandwidth waste and the low convolution calculation efficiency on a pulse array existing in the processing process of the existing image data format are solved, and the maximum calculation efficiency can be ensured when the pulse array carries out the convolution calculation.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

FIG. 1 shows a schematic diagram of a systolic array architecture to which an image data processing method of an embodiment of the present disclosure may be applied;

FIG. 2 is a flowchart illustrating a method of image data processing according to an exemplary embodiment;

fig. 3 is a schematic diagram showing image data in the chw_mat format;

FIG. 4 is a schematic diagram of representing image data in CHW_BLK format;

fig. 5 and 6 are schematic diagrams showing image data in hwc_mat format;

FIG. 7 is a schematic diagram showing representing image data in HWC_BLK format, according to an exemplary embodiment;

fig. 8 is a schematic diagram showing representing image data in hwc_blk format according to still another exemplary embodiment;

Fig. 9 is a process diagram showing a process of obtaining target image data by performing a format conversion process on original image data according to an exemplary embodiment;

FIG. 10 is a schematic diagram illustrating a matrix multiplication of image data and parameter data in HWC_BLK format, according to an exemplary embodiment;

fig. 11 is a schematic diagram illustrating a matrix multiplication operation of image data and parameter data in hwc_blk format according to still another exemplary embodiment;

FIG. 12 is a schematic diagram illustrating an image data processing method applied to a systolic array architecture, according to an example embodiment;

FIG. 13 is a schematic diagram illustrating the application of an image data processing method to a pooling operation in accordance with an exemplary embodiment;

FIG. 14 is a schematic diagram illustrating an image data processing method applied to a systolic array architecture, according to yet another example embodiment;

FIG. 15 is a block diagram of an image data processing apparatus according to an exemplary embodiment;

fig. 16 is a schematic diagram illustrating a structure of an electronic device suitable for use in implementing an exemplary embodiment of the present disclosure, according to an exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The drawings are merely schematic illustrations of the present disclosure, in which like reference numerals denote like or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in at least one hardware module or integrated circuit or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and not necessarily all of the elements or steps are included or performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the present specification, the terms "a," "an," "the," "said" and "at least one" are used to indicate the presence of at least one element/component/etc.; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc., in addition to the listed elements/components/etc.; the terms "first," "second," and "third," etc. are used merely as labels, and do not limit the number of their objects.

FIG. 1 shows a schematic diagram of a systolic array architecture to which an image data processing method of an embodiment of the present disclosure may be applied. The systolic array architecture 110 shown in fig. 1 may also be referred to as a smart chip architecture, where storage and processing are integrated together in a combination of distributed storage and central processing units, and acceleration processing is performed using a systolic array to meet various data processing and computing requirements.

As shown in fig. 1, the systolic array architecture may include a dynamic random access memory 110, a static random access memory (SRAM, static Random Access Memory) 120, an input buffer 130, a parameter buffer 140, and a systolic array 150.

The dynamic random access memory 110 may be a DDR (Double Data Rate Synchronous Dynamic Random Access Memory ) for storing and rapidly retrieving a large amount of data, such as data required for use in a machine learning algorithm and an image processing process. The sram 120 may be used to temporarily store data read from the dram 110 for subsequent processing by the systolic array. Since the sram 120 has a faster read/write speed than the dram 110, after the data is read into the sram 120, the data can be quickly obtained from the sram 120 and processed, thereby improving the performance of the system.

The parameter cache 130 may be used to store and retrieve parameters of tasks such as machine learning algorithms or image processing, such as weight matrices or algorithm parameters, which need to be read and updated frequently while training or performing tasks, thus improving access speed and efficiency through the cache. The input buffer 140 may be referred to as a Tensor buffer, and is used for storing and retrieving data for performing tasks such as machine learning algorithm or image processing, and meanwhile, the input buffer 140 may perform preprocessing such as normalization, standardization, etc. on the data to improve the execution efficiency of the tasks.

Systolic array 150 is a hardware accelerator designed specifically for artificial intelligence applications and is adapted to perform computational tasks common in deep learning, such as matrix multiplication and convolution. As shown in fig. 1, a plurality of calculation units 151 are included in the systolic array 150 for efficiently performing calculation tasks in deep learning. Each of the computation units 151 may include an adder and multiplier for performing matrix multiplication, convolution, etc., and the computation units 151 may be organized into a two-dimensional array, each computation unit 151 having a unique address, and may perform computations independently. Furthermore, the connections between the computational units 151 in the systolic array 150 are also multi-directional and may flow at different speeds, in such a way that data may flow between the different computational units, thereby allowing the computational tasks to be performed in parallel. Each computing unit 151 may also contain some local memory and execution units, which allows systolic array 150 to perform more complex computing tasks, e.g., the computing unit may utilize local memory to store some intermediate results, thereby avoiding duplicate computation of data.

In the disclosed embodiment, systolic array architecture 110 may: cutting the image to be processed to obtain one or more image blocks to be processed; obtaining original image data of an image block to be processed, and performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format, wherein the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel; inputting target image data into a pulsation array, and carrying out convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

It should be understood that the systolic array architecture shown in fig. 1 is only illustrative, and the systolic array architecture may include other dynamic random access memories, or other static random access memories, and the number of computing units in the systolic array architecture is illustrative, and a corresponding number of computing units may be set according to actual needs.

In order for those of ordinary skill in the art to better understand the technical solutions of the present disclosure, the steps of the image data processing method in the exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings and embodiments.

Fig. 2 is a flowchart illustrating an image data processing method according to an exemplary embodiment, and the method provided in the embodiment of fig. 2 may be implemented by any electronic device, for example, a systolic array architecture shown in fig. 1, where the systolic array architecture may include a systolic array, a dynamic random access memory, a static random access memory, an input buffer, and a parameter buffer, but the disclosure is not limited thereto. As shown in fig. 2, the image data processing method provided by the embodiment of the present disclosure may include the following steps.

Step S210: and cutting the image to be processed to obtain one or more image blocks to be processed.

The image to be processed refers to an image which needs to be further processed through a pulse array architecture, such as enhancing, repairing, filtering and the like, so as to improve the quality of the image or extract useful information.

In the systolic array architecture, the capacity of the sram is limited, and it is not possible to support the caching of large-sized image data, and the number of computing units in the systolic array is limited, and it is not possible to read the input data of the completed image to be processed at a time. The method comprises the steps of dividing an image to be processed to obtain one or more image blocks to be processed, processing each image block to be processed, sequentially transmitting image data of the image blocks to be processed to a pulsation array, processing images of the received image blocks through a calculation unit of the pulsation array, outputting processing results according to the sequence of the image blocks, and finally combining output results to obtain a final processing result.

Step S220: the method comprises the steps of obtaining original image data of an image block to be processed, performing format conversion processing on the original image data, and obtaining target image data arranged according to a preset data format.

The preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel, namely, a HWC_BLK format (height-width-channel continuous block). In processing image data, channel information (e.g., RGB channels) of each pixel is traversed first by a channel (channel), then pixels in each channel are traversed by a height (height), and finally the horizontal and vertical information of each pixel is traversed by a width (width), that is, in order of channel, height, and width in processing image data.

For ease of understanding, description will be made of data structures of an existing image data format chw_mat format (channel-height-width continuous matrix, format in which matrices are stored consecutively in order of channel, height, width), chw_blk format (channel-height-width continuous block, format in which blocks are stored consecutively in order of channel, height, width), and hwc_mat format (height-width-channel continuous matrix, format in which matrices are stored consecutively in order of height, width, channel).

Fig. 3 is a schematic diagram showing image data in a chw_mat format, which is a format in which matrices are stored consecutively in the order of channel, height, width. As shown in fig. 3, the pixel values of the pixel points in the feature map of the same channel are arranged together, and the pixel values of the pixel points in the feature map of the same channel are sequentially arranged according to the coordinate positions; and for the feature map of the same channel, pixel values of pixel points belonging to the same row are sequentially arranged in the same row according to the width direction, and pixel values of pixel points belonging to the same column are sequentially arranged in the same column according to the height direction. In fig. 3, R1 to R900 represent pixel values from 1 st pixel to 900 st pixel in the feature map of the R channel, G1 to G900 represent pixel values from 1 st pixel 1 to 900 st pixel 900 in the feature map of the G channel, and B1 to B900 represent pixel values from 1 st pixel to 900 st pixel in the feature map of the B channel.

Fig. 4 is a schematic diagram showing image data in a chw_blk format, which is a format in which blocks are stored consecutively in order of channel, height, and width. As shown in fig. 4, the pixel values of the respective pixels in the feature map of the same channel are arranged together, and the pixel values of all the pixels in the feature map of the same channel are divided into a plurality of blocks to be stored. In fig. 4, taking a 256 byte block size as an example, since data needs to be divided into 256 byte data blocks, a matrix block corresponding to 16x16 in the int8 data precision type is padded with 0 when the height and width of an input picture are less than 16 integer times, and thus the total size is 3x32x32.

Fig. 5 and 6 are diagrams showing image data in the hwc_mat format, which is a format of continuously storing a matrix in order of height, width, channel. In the hwc_mat format shown in fig. 5, the pixel values of the pixel points at the same coordinate position are arranged in the same row, and are arranged in the order of R, G, B (i.e., in the channel direction), the pixel values of the pixel points belonging to the same row (i.e., at the same height) are sequentially arranged in the same row in the width direction, and the pixel values of the pixel points belonging to the same column are sequentially arranged in the same column in the height direction. In the hwc_mat format shown in fig. 6, pixel values of pixels at the same coordinate position are arranged in the same row, and the pixel values of one pixel are arranged in the same row in the order of R, G, B (i.e., the channel direction).

In the ripple computing scene, the matrix block is a basic computing unit, so that the continuous storage data formats of CHW_MAT and HWC_MAT can cause data discontinuity in the block, and the waste of memory bandwidth is obviously increased; and, for convolution operations based on systolic arrays, the number of channels of the convolution intermediate result tends to increase layer by layer, the data structure of the chw_blk format may cause discontinuities in the storage addresses of the convolution calculation data, and each convolution needs to access discontinuous data up to the number of channels, which may significantly reduce the calculation efficiency of the convolution operation on the systolic calculation array.

To solve the above-described problem, the embodiment of the present disclosure proposes the hwc_blk format, which is a format in which blocks are stored consecutively in order of height, width, channel. Fig. 7 is a schematic diagram showing representing image data in hwc_blk format according to an exemplary embodiment. As shown in fig. 7, the hwc_blk format performs a block processing based on the hwc_mat format, taking a 256 byte block size as an example, since data needs to be divided into 256 byte data blocks, the data is corresponding to 16x16 matrix blocks under the int8 data precision type, each matrix block is internally in a Z-type data traversing manner, all matrix blocks are also arranged in a Z-type manner, and for the case that the size is not 16 integer multiples, zero padding to 16 integer multiples are needed. In fig. 7, the number of columns of the matrix is padded to an integer multiple of 16 in the channel direction, that is, the number of columns of the matrix is padded to an integer multiple of 16 in the height x width direction, that is, the number of rows of the matrix is padded to an integer multiple of 16.

One matrix Block (Block) is arranged in each line in the image data shown in fig. 7, and a plurality of matrix blocks (blocks) may be arranged in each line. Fig. 8 is a schematic diagram showing representing image data in hwc_blk format according to still another exemplary embodiment. As shown in fig. 8, the image data is composed of MxN matrix blocks (blocks), each row may be configured with N matrix blocks (blocks), each matrix Block (Block) has a size of 16x16, and each matrix row of the matrix blocks (blocks) is configured with pixel values of one pixel point on each channel, and since the image has only 3 channels, the remaining 13 values in each matrix row need to be zero-padded.

In step S220, the original image data of the image block to be processed is acquired, and the data structure of the original image data may be chw_mat format, chw_blk format, or hwc_mat format, or may be hwc_blk format. If the data structure of the original image data is CHW_MAT format, CHW_BLK format or HWC_MAT format, performing format conversion processing on the original image data to obtain target image data in HWC_BLK format. Of course, if the data structure of the original image data is in hwc_blk format, no processing may be performed.

In an exemplary embodiment, obtaining original image data of an image block to be processed, performing format conversion processing on the original image data, and obtaining target image data arranged according to a preset data format may include: determining position information of an image block to be processed in a dynamic random access memory; acquiring original image data stored in a dynamic random access memory according to the determined position information; and reading the original image data from the dynamic random access memory to the static random access memory, and performing format conversion processing on the original image data in the data reading process to obtain target image data.

In the embodiment of the disclosure, the dynamic random access memory can be used for storing the image data of the whole image to be processed. After determining the image block to be processed which needs to be subjected to convolution calculation, the position information of the image block to be processed in the dynamic random access memory can be determined, then the original image data of the image block to be processed is obtained according to the position information, and then the original image data of the image block to be processed is read into the static random access memory.

In the process of reading original image data from a dynamic random access memory to a static random access memory, format conversion processing can be performed on the original image data to obtain target image data stored according to a preset data format, namely, the original image data is converted into a format of continuously storing blocks according to the sequence of the height, the width and the channels.

Fig. 9 is a process diagram showing a process of obtaining target image data by performing a format conversion process on original image data according to an exemplary embodiment. As shown in fig. 9, the process of performing the format conversion process on the original image data to obtain the target image data may include the following steps.

Step S910: and acquiring the data precision type of the image to be processed and the byte number of the single-time read data of the static random access memory.

The data precision type refers to a representation mode of data in a computer and numerical precision ensured by the representation mode, and different data precision types have different numerical ranges and precision so as to meet different requirements and scenes. Common data precision types include integer types, such as int4, int8, int16, etc., and also fractional types, such as fp8, fp16, fp32, etc., which are used to store integer data, while fractional types are used to store fractional data. In the embodiment of the disclosure, the data precision type of the image to be processed may include int4, int8, fp8, int16, fp32, and the like.

It should be noted that, the type of data precision of the image to be processed is 2 times of power data precision, the number of rows of the target matrix and the number of columns of the target matrix are aligned according to the power of 2, and the consistency of the calculation format can be maintained when the data calculation is performed subsequently.

The static random access memory has the characteristic of batch reading. In the embodiment of the disclosure, batch operation is performed when image data is read from the dynamic random access memory into the static random access memory, and the size of the data read in each batch may be limited by the bandwidth between the dynamic random access memory and the static random access memory or set by the memory management unit; and batch operations are also possible when image data is read from the sram into the input buffer, because when processing images, deep learning models, or other computing tasks requiring large amounts of data, it is often necessary to divide the data into smaller batches for processing, which can reduce processing time and computing resource requirements and improve overall performance.

The number of bytes of data that a sram reads at a time may refer to the maximum number of bytes that the sram can read data at a time. For example, the number of bytes of data read at a time by the sram is 256 bytes.

Step S920: and determining the number of rows of the target matrix and the number of columns of the target rectangle according to the data precision type of the image to be processed and the number of bytes of the single read data of the static random access memory.

The number of rows of the target matrix is the number of rows corresponding to the data amount of the single read data of the static random access memory, and the number of columns of the target matrix is the number of columns corresponding to the data amount of the single read data of the static random access memory. Further, determining the number of rows of the target matrix and the number of columns of the target rectangle according to the data precision type of the image to be processed and the number of bytes of the single read data of the static random access memory may include: calculating the data quantity of the single-time read data meeting the data precision type of the image to be processed according to the byte number of the single-time read data of the static random access memory; and determining the number of rows of the target matrix and the number of columns of the target matrix according to the data quantity of the single-read data.

And after the data precision type of the image to be processed and the byte number of the single-time read data of the static random access memory are obtained, the data quantity of the single-time read data of the static random access memory, namely the size of a single-time read data block, can be calculated, and then the number of rows and the number of columns of the target matrix, namely the number of rows and the number of columns of the single-time read data block of the static random access memory, are determined according to the size of the single-time read data block.

For example, the data precision type of the image to be processed is int8, the byte number of the single read data of the static random access memory is 256 bytes, the size of the single read data block of the static random access memory is calculated to be 16x16, and then the number of rows of the target matrix and the number of columns of the target matrix can be integer multiples of 16.

For another example, the data precision type of the image to be processed is int16 or fp16, the byte number of the single read data of the static random access memory is 256 bytes, 2 8x8 data blocks read by the static random access memory at a time are obtained by calculation, the data blocks in each data block and between the blocks are still arranged in a Z-shaped sequence, that is, the data blocks of 2 8x8 can be obtained by reading by the static random access memory at a time and input into a pulse array for calculation, the number of rows of the target matrix is an integer multiple of 8 (that is, the value of the size of the data block), and the number of columns of the target matrix is an integer multiple of 16.

For example, the data precision type of the image to be processed is int32 or fp32, the byte number of the single read data of the static random access memory is 256 bytes, the data blocks of 4x4 are obtained by calculating the single read data of the static random access memory, the data blocks and the data blocks are still arranged in a Z-shaped sequence, that is, the data blocks of 4x4 can be obtained by one read of the static random access memory and input into a pulse array for calculation, the number of lines of a target matrix is an integer multiple of 4 (namely, the value of the size of the data block), and the number of columns of the target matrix is an integer multiple of 16.

For example, the data precision type of the image to be processed is int4 or fp4, the byte number of the single read data of the static random access memory is 256 bytes, the data format is changed into the block size of 32x32 and is in Z-type distribution, and the static random access memory can obtain a complete matrix block after two reads to obtain the number of the target matrix lines is 32, and the number of the target matrix columns is 32.

In the embodiment of the disclosure, in the process of performing data format conversion on original image data, firstly, calculating the data quantity of single-time read data according to the data precision type of an image and the byte number of the single-time read data of a static random access memory, and further determining the number of rows of a target matrix and the number of columns of the target matrix according to the data quantity, so as to perform format conversion on the original image data according to the number of rows of the target matrix and the number of columns of the target matrix. Thus, the target image data obtained after format conversion can keep the consistency of the calculation format under the data precision of the image, and no extra data conversion cost is needed.

Step S930: and carrying out data filling processing on the original image data according to the number of rows and columns of the target matrix to obtain target image data.

Further, performing data filling processing on the original image data according to the number of rows and columns of the target matrix to obtain target image data may include: sequentially ordering the data of each pixel point in the channel direction according to the coordinate position of each pixel point to obtain ordered image data, wherein the matrix column number of the ordered image data is the channel number corresponding to the original image data, and the matrix column number of the ordered image data is the pixel number contained in the original image data; and performing data filling processing on the ordered image data in the column direction according to the target matrix column number, and performing data filling processing on the ordered image data in the row direction according to the target matrix line number to obtain target image data, wherein the matrix column number of the target image data is the target matrix column number, and the matrix line number of the target image data is the target matrix line number.

The original image data comprises pixel data of each pixel point in the image block to be processed, and the pixel data comprises coordinate positions of each pixel point and data of each pixel point in the channel direction. In the process of data filling, the pixels can be ordered according to the coordinate positions of the pixels, specifically, the coordinate positions of the pixels include the height information and the width information of the pixels, the pixels can be ordered according to the height information, then the pixels at the same height are ordered according to the width information of the pixels.

For example, the coordinate positions of the 1 st to 4 th pixels are (1, 1) (1, 2) (1, 3) (1, 4) in order, the coordinate positions of the 5 th to 8 th pixels are (2, 1) (2, 2) (2, 3) (2, 4) in order, the coordinate positions of the 9 th to 12 th pixels are (3, 1) (3, 2) (3, 3) (3, 4) in order, and the coordinate positions of the 13 th to 16 th pixels are (4, 1) (4, 2) (4, 3) (4, 4). When the pixels are ordered according to the coordinate positions of the pixels, the pixels in the 1 st row are ordered, the pixels in the 2 nd row are ordered, and so on, the pixels in the 4 th row are ordered, when the pixels in each row are ordered, the pixels are ordered according to the column information of each pixel, and the sequence of the obtained pixels is from the 1 st pixel to the 16 th pixel. If the channel is R, G, B three channels, the resulting ordered image data is a matrix of 3x 16.

After the ordering sequence of each pixel point is obtained, sequentially ordering the data of each pixel point in the channel direction according to the ordering sequence of each pixel point, and obtaining ordered image data. Wherein, each row is arranged with data of one pixel point in the channel direction, and the data of each pixel point in the channel direction can be arranged sequentially according to the R, G, B channel direction, so as to obtain the ordered image data. Then, performing data filling processing on the ordered image data in the column direction according to the number of columns of the target matrix, so that the number of columns after the filling processing is the number of columns of the target matrix; and performing data filling processing on the ordered image data in the row direction according to the target matrix row number, so that the row number after the filling processing is the target matrix row number, and finally, the data after the filling processing is the target image data.

In the above example, the obtained ordered image data is a matrix with a size of 3×16, the number of rows of the target matrix and the number of columns of the target matrix are both 16, the matrix array of the ordered image matrix is padded to 16, the obtained target image data is a matrix with a size of 16×16, and each matrix row is arranged with data of one pixel point in each channel direction.

In the embodiment of the disclosure, in the process of performing data filling processing on original image data according to the number of rows and the number of columns of a target matrix, each pixel point can be ordered according to the coordinate position of each pixel point in the original image data, then the data of each pixel point in the channel direction is ordered in sequence according to the ordering sequence of each pixel point, so as to obtain ordered image data, the data of each pixel point in the channel direction is arranged in the same row, the data of different pixel points in the channel direction is arranged in different rows, and further the data filling processing can be performed according to the number of rows and the number of columns of the target matrix, so that the original image data can be converted into the target image data arranged in a format of continuous storage blocks according to the sequence of height, width and channel, and the method is applicable to calculation of a systolic array, and the data segmentation calculation efficiency is more efficient.

In an exemplary embodiment, the image data processing method may further include: acquiring parameter data, and segmenting the parameter data to acquire one or more segmented parameter data; and determining target parameter data corresponding to the target image data from one or more pieces of segmented parameter data.

The parameter data may include a convolution kernel and other related parameter settings. Convolution kernels are an important component of a convolutional neural network that defines the way the convolutional layers are calculated. In convolution computation, the convolution kernel performs a dot product with the input data and then sums up to obtain new pixel values, thereby forming a new image. Each weight parameter in the convolution kernel corresponds to a pixel location in the input data for weighted summation calculations. In addition to the convolution kernel, the parameter data includes other relevant parameter settings, such as stride (stride), padding (padding), etc. The setting of these parameters affects the result of the convolution calculation. For example, the step size determines the step size by which the convolution kernel moves over the input data, while the padding adds extra pixel values at the edges of the input data to keep the size before and after convolution unchanged. By adjusting the convolution kernel and the parameter setting, different feature extraction and mapping operations can be performed on the input data.

In the embodiment of the disclosure, the parameter data is stored in a dynamic random access memory, the parameter data is read from the dynamic random access memory into a static random access memory when the convolution calculation is performed, then the parameter data is input into a pulsation array, and the calculation unit of the pulsation array performs the convolution calculation on the parameter data and the target image data. Because the capacity of the static random access memory is limited, in the embodiment of the disclosure, the parameter data can be segmented to obtain one or more segmented parameter data, each segmented parameter data contains data required by convolution, and the segmented parameter data are mutually independent.

The target parameter matrix is parameter data required by the convolution calculation, and the target parameter data is determined from one or more pieces of segmented parameter data so as to carry out the convolution calculation on the target image data and the target parameter data. It should be noted that, in the embodiment of the disclosure, the parameter data acts on the target image data, and after the parameter data is segmented, convolution calculation is performed on each segmented parameter data and the target image data.

In the embodiment of the disclosure, the parameter data is segmented to obtain one or more segmented parameter data, so that the static random access memory can read the segmented parameter data, and the problem of capacity limitation of the static random access memory can be solved.

Step S230: inputting target image data into a pulsation array, and carrying out convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

In an exemplary embodiment, inputting the target image data into the systolic array, and performing convolution calculation on the target image data by using the systolic array to obtain a convolution calculation result corresponding to the image block to be processed, which may include: reading target image data into an input buffer, and reading target parameter data into a parameter buffer; inputting target image data from an input buffer to a systolic array, and inputting target parameter data from a parameter buffer to the systolic array; and carrying out convolution calculation on the target image data and the target parameter data through the pulsation array to obtain a convolution calculation result.

Fig. 10 is a schematic diagram illustrating a matrix multiplication operation of image data and parameter data in hwc_blk format according to an exemplary embodiment. The static random access memory reads 256 bytes of data at a time, the data precision type is int8, the data block size of the image data read at a time by the static random access memory is 16x16, and the data block size of the parameter data is 16x16. Fig. 10 shows a matrix multiplication operation of data blocks of 6 image data and data blocks of 6 parameter data. From the reading and writing times of the static random access memory, the method comprises the following steps: the image data needs to be read 6 times; the parameter data only needs 6 times if there is an intermediate buffer, and 12 times if there is no intermediate buffer. The image data are arranged in a Z-shaped mode, the data blocks of each image data are arranged in a Z-shaped mode, the parameter data are arranged in an N-shaped mode, and the data blocks of each parameter data are arranged in an N-shaped mode.

Fig. 11 is a schematic diagram illustrating a matrix multiplication operation of image data and parameter data in hwc_blk format according to still another exemplary embodiment. The static random access memory reads 256 bytes of data at a time, the data precision type is fp16, the data block size of the image data read at a time by the static random access memory is 2 x8, and the data block size of the parameter data is 8x8. Fig. 11 shows convolution calculation of image data read twice by the sram. The image data comprises 4 data blocks of 8x8, the parameter data comprises 4 data blocks of 8x8, the image data is in Z-shaped arrangement, the data blocks of each image data are in Z-shaped arrangement, the parameter data are in N-shaped arrangement, and the data blocks of each parameter data are in N-shaped arrangement. From the reading and writing times of the static random access memory, the method comprises the following steps: the image data is read for 2 times, the parameter data is only read for 2 times under the condition of having a buffer, and the output data can be directly output without buffering. And the image data input into the systolic array is 8x8 block size and is distributed in a Z shape, after the systolic array is calculated and output, the output data is also 8x8 block size and is distributed in a Z shape, the visible data format is consistent in the calculation process, and the additional data conversion cost is not needed.

In the embodiment of the disclosure, the input buffer can be used as a bridge between a pulse array for performing convolution calculation and distributed storage, so that image data can be more efficiently transmitted and processed between the pulse array and the distributed storage, and meanwhile, the input buffer can also perform preprocessing of the image data, such as normalization, standardization and other operations, so as to support efficient deep learning model training; parameter buffering may be used to store and retrieve parameter data that needs to be read and updated frequently during training or task execution, and by caching, access speed and efficiency can be improved.

Fig. 12 is a schematic diagram illustrating an image data processing method applied to a systolic array architecture, according to an example embodiment. The image data processing in the embodiment of the disclosure may be convolution calculation, in which in a process of performing convolution calculation through a systolic array architecture, image data and corresponding parameter data participating in the convolution calculation are read from DDR and stored in SRAM, then the image data is read into a Tensor buffer by SRAM, and the parameter data is read into a parameter buffer by SRAM, and then the image data and the parameter data are transmitted to a calculation unit of the systolic array for calculation.

As shown in fig. 12, the image data transferred into the Tensor buffer is image data in hwc_blk format. In the example shown in fig. 12, the image to be processed, which needs to be convolved, needs to be split into multiple patches (image blocks) to be processed separately, for example, when the data precision type of the image to be processed is int8, the size of the SRAM single read data is 256 bytes, the size of the data block transferred to the Tensor buffer is 16x16, and the image to be processed can be split according to the size of the data block. The data corresponding to the Patch1 in the figure are read from the DDR to the SRAM corresponding to the Patch1 in a unified mode, and the read process can carry out the filling processing on the data. The data filling process may include performing data filling according to the steps S910 to S930, so that the data read into the SRAM is the image data in hwc_blk format; the fill-in process may also include filling in fill-in around the input data for some CNN models, for example, filling a 3x3 matrix up and down, left and right, one row each, into a 5x5 matrix. It should be noted that, because parameters of different CNN models are different, the number of padding corresponding to each call is different.

In addition, since the image data in the hwc_blk format contains all channel information, all data required by the single convolution corresponding to Patch1 in the MI2C result matrix is complete and can be used for matrix multiplication operation of the systolic array. In the convolution operation, each convolution kernel needs to slide a window on all channels of input data, in order to facilitate parallel calculation, an img2col algorithm is adopted to convert target image data input into a Tensor buffer into a new matrix, and the convolution function is realized through matrix multiplication. The MI2C result matrix is the converted matrix, where each row of the matrix has k×k×c data, k is the convolution kernel size, and C is the channel number of the image data, so as long as all channel information is included, a complete row of k×k×c data can be obtained.

As shown in fig. 12, the parameter data matrix is also faced with the problem of SRAM capacity limitation, so that the parameter data matrix is divided into N batches, each batch contains parameter data required by convolution, the batch size is constrained by the existing capacity of the SRAM, for example, when the data precision type of the image to be processed is int8, the size of the SRAM read data is 256 bytes at a time, the size of each batch of the parameter data matrix is 16x16, and the batch data are independent.

Considering that the most of the images to be processed are in CHW_MAT format, all data of the corresponding area of the Patch1 are acquired through multiple instruction splicing and calling and spliced into the SRAM corresponding to the Patch1, then the data are converted into HWC_BLK through the data format, and then the HWC_BLK is read into a Tensor cache and transferred into a pulse array from the Tensor cache to carry out convolution calculation. Because channel information is farther apart, and for the CHW_MAT format, the cost of splicing data to the SRAM corresponding to Patch1 is higher than that of HWC, so that the HWC_BLK format provided by the embodiment of the disclosure is not only suitable for an efficient multi-precision systolic array, but also enables the calculation efficiency of image data processing to be more efficient.

Fig. 13 is a schematic diagram showing an application of an image data processing method to a pooling operation according to an exemplary embodiment. The image data processing method provided by the embodiment of the disclosure can be applied to calculation of a systolic array architecture, and can also be applied to other scenes, such as pooling operation. As shown in FIG. 13, the input matrix may have patches in the horizontal and vertical directions, so that it is required to ensure that the size of each Patch is an integer multiple of the single read amount of the SRAM, so as to facilitate data arrangement in the memory, and the input matrix may wait for multiple patches to complete and then perform unified operation at the input position.

Fig. 14 is a schematic diagram showing an image data processing method applied to a systolic array architecture according to still another exemplary embodiment. The number of bits allocated to the channel parameter in the computer instruction is limited, for example, only 10 bytes in the instruction represent the channel, the maximum corresponding value is 2ζ10=1024, and if the channel value of a certain image data is 2048, the image data needs to be split into two instruction processes. As shown in fig. 13, if the number of channels exceeds the limit of the instruction, the input data is divided into multiple patches along the channels, the image data input into the Tensor buffer is divided into multiple channel blocks, and the above-described image processing strategy is executed block by block, but the output result of each block needs to be temporarily stored in the SRAM, after waiting for the output of the subsequent block result to the SRAM, the matrix addition instruction function of the pulse array is called, and the input data is added with the previous block calculation result, and finally the output can be obtained after unified post-processing.

According to the image data processing method provided by the embodiment of the disclosure, the target image data is represented by the preset data format, namely, the image data input into the Tensor cache, so that the target image data adopts a segmented data structure, and the target image data is stored according to the sequence of the height, the width and the channels, so that the target image data contains the data of each channel, the read target image data can be immediately used for the block operation of the matrix without considering address conversion and judgment, the problem that the data structure of the CHW_MAT format, the CHW_BLK format and the HWC_MAT format is discontinuous in the operation process is solved, and the maximum calculation efficiency can be ensured when the pulse array carries out convolution calculation.

It should be understood that the same/similar parts of the embodiments of the method described above in this specification may be referred to each other, and each embodiment focuses on differences from other embodiments, and references to descriptions of other method embodiments are only needed.

FIG. 15 is a block diagram illustrating an image data processing apparatus applied to a systolic array architecture including a systolic array, according to an example embodiment. As shown in fig. 15, the image data processing apparatus 1500 may include a segmentation unit 1510, a format conversion unit 1520, and a calculation unit 1530.

Wherein the segmentation unit 1510 is configured to segment the image to be processed to obtain one or more image blocks to be processed; the format conversion unit 1520 is configured to obtain original image data of an image block to be processed, perform format conversion processing on the original image data, and obtain target image data arranged according to a preset data format, where the preset data format is a format in which blocks are continuously stored in order of height, width, and channel; the calculation unit 1530 is configured to input the target image data to a systolic array, and perform convolution calculation on the target image data by the systolic array, to obtain a convolution calculation result corresponding to the image block to be processed.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises dynamic random access memory and static random access memory; wherein the format conversion unit 1520 is further configured to: determining position information of an image block to be processed in a dynamic random access memory; acquiring original image data stored in a dynamic random access memory according to the determined position information; and reading the original image data from the dynamic random access memory to the static random access memory, and performing format conversion processing on the original image data in the data reading process to obtain target image data.

In some exemplary embodiments of the present disclosure, format conversion unit 1520 is further configured to: acquiring the data precision type of an image to be processed and the byte number of single read data of a static random access memory; determining the number of rows of a target matrix and the number of columns of a target rectangle according to the data precision type of the image to be processed and the number of bytes of single read data of the static random access memory; and carrying out data filling processing on the original image data according to the number of rows and columns of the target matrix to obtain target image data.

In some exemplary embodiments of the present disclosure, format conversion unit 1520 is further configured to: calculating the data quantity of the single-time read data meeting the data precision type of the image to be processed according to the byte number of the single-time read data of the static random access memory; and determining the number of rows of the target matrix and the number of columns of the target matrix according to the data quantity of the single-read data.

In some exemplary embodiments of the present disclosure, the raw image data includes pixel data of each pixel point in the image block to be processed, the pixel data including coordinate positions of each pixel point and data of each pixel point in a channel direction; wherein the format conversion unit 1520 is further configured to: according to the coordinate positions of the pixel points, sequentially ordering the data of the pixel points in the channel direction according to the rows to obtain ordered image data; the matrix column number of the ordered image data is the channel number corresponding to the original image data, and the matrix column number of the ordered image data is the pixel number contained in the original image data; according to the number of columns of the target matrix, performing data filling processing on the ordered image data in the column direction, and according to the number of rows of the target matrix, performing data filling processing on the ordered image data in the row direction to obtain target image data; the matrix column number of the target image data is a target matrix column number, and the matrix line number of the target image data is a target matrix line number.

In some exemplary embodiments of the present disclosure, the segmentation unit 1510 is further configured to: acquiring parameter data, and segmenting the parameter data to acquire one or more segmented parameter data; and determining target parameter data corresponding to the target image data from one or more pieces of segmented parameter data.

In some exemplary embodiments of the present disclosure, the systolic array architecture further comprises an input cache and a parameter cache; wherein the computing unit 1530 is further configured to: reading target image data into an input buffer, and reading target parameter data into a parameter buffer; inputting target image data from an input buffer to a systolic array, and inputting target parameter data from a parameter buffer to the systolic array; and carrying out convolution calculation on the target image data and the target parameter data through the pulsation array to obtain a convolution calculation result.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 16 is a schematic diagram illustrating a structure of an electronic device suitable for use in implementing an exemplary embodiment of the present disclosure, according to an exemplary embodiment. It should be noted that the electronic device 1600 shown in fig. 16 is only an example, and should not impose any limitation on the functions and usage scope of the embodiments of the present disclosure.

As shown in fig. 16, the electronic device 1600 is embodied in the form of a general purpose computing device. The components of the electronic device 1600 may include, but are not limited to: the at least one processing unit 1610, the at least one memory unit 1620, and a bus 1630 connecting the different system components (including the memory unit 1620 and the processing unit 1610).

Wherein the storage unit stores program code that is executable by the processing unit 1610 such that the processing unit 1610 performs steps according to various exemplary embodiments of the present invention described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 1610 may perform step S210 as shown in fig. 2: cutting the image to be processed to obtain one or more image blocks to be processed; step S220: obtaining original image data of an image block to be processed, and performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format, wherein the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel; step S230: inputting target image data into a pulsation array, and carrying out convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

The memory unit 1620 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 16201 and/or cache memory 16202, and may further include Read Only Memory (ROM) 16203.

The storage unit 1620 may also include a program/utility 16204 having a set (at least one) of program modules 16205, such program modules 16205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 1630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.

The electronic device 1600 can also communicate with one or more external devices 1660 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 1600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1650. Also, electronic device 1600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, e.g., the Internet, through network adapter 1640. As shown, network adapter 1640 communicates with other modules of electronic device 1600 over bus 1630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.

A program product for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An image data processing method applied to a systolic array architecture, the systolic array architecture comprising a systolic array, the method comprising:

cutting the image to be processed to obtain one or more image blocks to be processed;

obtaining original image data of the image block to be processed, and performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format; the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel;

And inputting the target image data into the pulsation array, and performing convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

2. The method of claim 1, wherein the systolic array architecture further comprises dynamic random access memory and static random access memory;

the obtaining the original image data of the image block to be processed, performing format conversion processing on the original image data to obtain target image data arranged according to a preset data format, including:

determining position information of the image block to be processed in the dynamic random access memory;

acquiring the original image data stored in the dynamic random access memory according to the determined position information;

and reading the original image data from the dynamic random access memory to the static random access memory, and performing format conversion processing on the original image data in the data reading process to obtain the target image data.

3. The method according to claim 2, wherein the performing format conversion processing on the original image data to obtain the target image data includes:

Acquiring the data precision type of the image to be processed and the byte number of the single read data of the static random access memory;

determining a target matrix row number and a target rectangular column number according to the data precision type of the image to be processed and the byte number of the single read data of the static random access memory;

and carrying out data filling processing on the original image data according to the target matrix row number and the target matrix column number to obtain the target image data.

4. A method according to claim 3, wherein said determining a target matrix row number and a target rectangular column number based on the data precision type of the image to be processed and the byte number of the sram single read data comprises:

calculating the data quantity of the single read data meeting the data precision type of the image to be processed according to the byte number of the single read data of the static random access memory;

and determining the number of rows of the target matrix and the number of columns of the target matrix according to the data quantity of the single read data.

5. A method according to claim 3, wherein the raw image data comprises pixel data of each pixel point in the image block to be processed, the pixel data comprising coordinate positions of each pixel point and data of each pixel point in a channel direction;

The step of performing data filling processing on the original image data according to the target matrix row number and the target matrix column number to obtain the target image data includes:

according to the coordinate positions of the pixel points, sequentially ordering the data of the pixel points in the channel direction according to the rows to obtain ordered image data; the matrix column number of the ordered image data is the channel number corresponding to the original image data, and the matrix column number of the ordered image data is the pixel number contained in the original image data;

performing data filling processing on the ordered image data in the column direction according to the number of columns of the target matrix, and performing data filling processing on the ordered image data in the row direction according to the number of rows of the target matrix to obtain the target image data; the matrix column number of the target image data is the target matrix column number, and the matrix column number of the target image data is the target matrix column number.

6. The method according to claim 1, wherein the method further comprises:

acquiring parameter data, and segmenting the parameter data to acquire one or more segmented parameter data;

And determining target parameter data corresponding to the target image data from the one or more sliced parameter data.

7. The method of claim 6, wherein the systolic array architecture further comprises an input cache and a parameter cache;

the step of inputting the target image data into the systolic array, and performing convolution calculation on the target image data through the systolic array to obtain a convolution calculation result corresponding to the image block to be processed, wherein the step of obtaining the convolution calculation result comprises the following steps:

reading the target image data into the input buffer memory, and reading the target parameter data into the parameter buffer memory;

inputting the target image data from the input buffer to the systolic array, and inputting the target parameter data from the parameter buffer to the systolic array;

and carrying out convolution calculation on the target image data and the target parameter data through the pulsation array to obtain a convolution calculation result.

8. An image data processing apparatus for application to a systolic array architecture, the systolic array architecture comprising a systolic array, the apparatus comprising:

the segmentation unit is configured to segment the image to be processed to obtain one or more image blocks to be processed;

A format conversion unit configured to obtain original image data of the image block to be processed, and perform format conversion processing on the original image data to obtain target image data arranged according to a preset data format; the preset data format is a format for continuously storing blocks according to the sequence of the height, the width and the channel;

the calculation unit is configured to input the target image data into the pulsation array, and perform convolution calculation on the target image data through the pulsation array to obtain a convolution calculation result corresponding to the image block to be processed.

9. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the image data processing method of any one of claims 1 to 7.

10. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the image data processing method of any one of claims 1 to 7.