CN114860433A - Method for performing pooling calculation operation on simulator - Google Patents

Method for performing pooling calculation operation on simulator Download PDF

Info

Publication number
CN114860433A
CN114860433A CN202210440779.3A CN202210440779A CN114860433A CN 114860433 A CN114860433 A CN 114860433A CN 202210440779 A CN202210440779 A CN 202210440779A CN 114860433 A CN114860433 A CN 114860433A
Authority
CN
China
Prior art keywords
image data
pooling
simulator
processing unit
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210440779.3A
Other languages
Chinese (zh)
Inventor
宋舒寒
曹华伟
叶笑春
范东睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202210440779.3A priority Critical patent/CN114860433A/en
Publication of CN114860433A publication Critical patent/CN114860433A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5011Pool

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Image Processing (AREA)

Abstract

The embodiment of the invention provides a method for performing pooling calculation operation on a simulator, which comprises the following steps: acquiring the scale of image data to be pooled and resource information of a designated processing unit which can be used for current pooling on a simulator, wherein the resource information at least comprises the number of the designated processing units; carrying out scale adjustment on the image data according to resource information of a designated processing unit which can be used for pooling at this time on a simulator to obtain adjusted image data, so that pooling calculation operation on the adjusted image data can be evenly distributed on the designated processing unit; and the adjusted image data is distributed to the designated processing unit on the simulator in an equalizing way to execute pooling calculation operation.

Description

Method for performing pooling calculation operation on simulator
Technical Field
The present invention relates to the field of computer architecture, in particular to the technical field of accelerating computation by using a simulator, and more particularly to a method for performing pooling computation operation on the simulator.
Background
Simulators play a very important role in the design and development process of computer systems. In the initial stage of design, the simulator can be used for performing coarse-grained simulation on various design schemes, and the optimal design scheme is selected by comparing simulation results; during product development, simulators are used to evaluate various microstructure designs, compromising on some choices; in the later stage of product development, the simulator is mainly used for developing system software of a target system, so that the software and hardware development can be carried out simultaneously, and the system development speed is accelerated; after the system is completed, the simulator can obtain rich trace information, so that bottle neck analysis and performance optimization can be performed on the system. Because of the important role of simulators, a large number of simulators have been developed in both academic and industrial settings. Taking a DPU (deep learning Processor unit) simulator in the deep learning field as an example, a hierarchy of the DPU simulator mainly includes a framework and a dynamic link library of a DPU module, and for an application programmer, upper-layer application related data and instructions are mainly used, including to-be-processed data, an ARM program, and a data stream program. Data and instructions related to upper-layer application are loaded into a memory when a DPU simulator is initially loaded, an ARM core executes an ARM program from a specific address in the memory, and a DMA and data stream array microcontroller is configured in the ARM program execution process, so that the data, results and data stream program are copied and executed, and the simulation purpose is achieved.
The pooling layer is the most common data processing layer in deep learning, and has a very obvious function: the size of the characteristic diagram is reduced, namely, the calculation amount and the required video memory can be reduced. The size of the spatial information can be reduced through the pooling layer, the operation efficiency is improved, meanwhile, the spatial information reduced by the pooling layer means that parameters are reduced, the risk of overfitting (Overfit) is reduced, and spatial transformation invariance is obtained. There are four common pooling calculation operations: average pooling, maximum pooling, random pooling, and global average pooling. Taking average pooling as an example, average pooling is a pooling technique in convolutional neural networks for image recognition. Its working principle is to slide a window, such as a pixel, over a local region of a feature and then take the average of all values in the window as the output. It compresses the input representation into a representation of a lower dimension.
The inventors have found, in the course of implementing the pooling layer with the DPU simulator, that implementing the pooling layer for the DPU simulator is now a direct assignment based on the original data format of the task to be assigned. Because the tasks to be distributed are not reasonably planned according to the processing resources of the current DPU simulator, the problems of insufficient utilization of the processing resources and/or uneven resource distribution are caused. For example, there is no reasonable way to distribute data such that all processing units are invoked to operate together to perform computing functions. For example, the data functions divided into each processing unit are not equal, resulting in more work for some processing units, less work for some processing units, and unbalanced resource allocation. Therefore, there is a need for improvements in the prior art.
Disclosure of Invention
It is therefore an object of the present invention to overcome the above-mentioned drawbacks of the prior art and to provide a method for performing pooled computational operations on a simulator.
The purpose of the invention is realized by the following technical scheme:
according to a first aspect of the present invention there is provided a method of performing pooled computational operations on a simulator, comprising: acquiring the scale of image data to be pooled and resource information of a designated processing unit which can be used for current pooling on a simulator, wherein the resource information at least comprises the number of the designated processing units; carrying out scale adjustment on the image data according to resource information of a designated processing unit which can be used for pooling at this time on a simulator to obtain adjusted image data, so that pooling calculation operation on the adjusted image data can be evenly distributed on the designated processing unit; and uniformly distributing the adjusted image data to the designated processing unit on the simulator to execute pooling calculation operation.
In some embodiments of the invention, the scaling includes adjusting a pixel size or a number of channels of the image data to enable averaging of the image data at a pixel level and/or at a channel level by the designated processing unit.
In some embodiments of the invention, the pixel size of the image data is adjusted by adding all zero pixels at the pixel edge of the image data; alternatively, the number of channels of the image data is adjusted by supplementing all-zero channels in the image data.
In some embodiments of the invention, the method further comprises: determining whether the image data can be equally divided by the designated processing unit at least one of the pixel level and the channel level according to the resource information of the designated processing unit available for the current pooling on the simulator; if not, the image data is resized at least one of the pixel level and the channel level.
In some embodiments of the present invention, the step of performing pooling calculation operations by equal distribution of the adjusted image data to designated processing units on a simulator comprises: and at the adjusted scale level, the adjusted image data is distributed to a processing unit designated on a simulator in an equalizing way to execute pooling calculation operation.
In some embodiments of the present invention, the resource information further includes a number of data in a specified processing unit for one operation of the current pooled SIMD instruction, where the scaling includes: and supplementing all zero channels in the image data according to the resource information of the specified processing unit which can be used for the current pooling on the simulator and the number of data of the SIMD instruction one-time operation in the specified processing unit, so that the number of the channels of the adjusted image data is an integral multiple of the product of the number of the specified processing units and the number of the data of the SIMD instruction one-time operation in the current pooling.
In some embodiments of the invention, the method further comprises: determining whether the number of channels of the image data is equal to the number of data of one operation of the SIMD instruction for the current pooling or not according to the resource information of the designated processing unit which can be used for the current pooling on the simulator, and if so, performing scale adjustment on the image data at the pixel level; and, at the pixel level, the adjusted image data is evenly distributed to a processing unit designated on the simulator to perform pooling calculation operations.
In some embodiments of the invention, the method further comprises: and acquiring a middle pooling result obtained by performing pooling calculation operation on the adjusted image data, and extracting a pooling result of the image data from the middle pooling result.
According to a second aspect of the present invention, there is provided an electronic apparatus comprising: one or more processors; and a memory, wherein the memory is to store executable instructions; the one or more processors are configured to implement the steps of the method of the first aspect via execution of the executable instructions.
Compared with the prior art, the invention has the advantages that:
the method comprises the steps of obtaining the scale of image data needing pooling and resource information of a specified processing unit which can be used for pooling at this time on a simulator, and carrying out scale adjustment on the image data according to the resource information of the specified processing unit which can be used for pooling at this time on the simulator, so that the adjusted image data are uniformly distributed to the specified processing unit on the simulator to execute pooling calculation operation. Thus, the available computing resources on the simulator are fully utilized to improve the resource utilization rate.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
FIG. 1 is a flow diagram of a method of performing pooled computational operations on a simulator in accordance with an embodiment of the present invention;
FIG. 2 is a schematic diagram of the scaling at the channel level according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an arrangement of image data scaled at the channel level in an on-chip memory of a simulator according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an exemplary constant transfer according to the present invention;
FIG. 5 is a schematic diagram of an exemplary pooling process according to the present invention;
fig. 6 is a schematic diagram of the principle of resizing at the pixel level, according to an example of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by embodiments with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As mentioned in the background section, the inventors discovered, in implementing the pooling layer with the DPU simulator, that implementing the pooling layer for the DPU simulator is currently a direct assignment based on the original data format of the task to be assigned. Because the tasks to be distributed are not reasonably planned according to the processing resources of the current DPU simulator, the processing resources are not fully utilized and/or the resources are distributed unevenly. Therefore, the method and the device for processing the image data acquire the scale of the image data needing pooling and the resource information of the designated processing unit which can be used for pooling at this time on the simulator, and perform scale adjustment on the image data according to the resource information of the designated processing unit which can be used for pooling at this time on the simulator, so that the adjusted image data are distributed to the designated processing unit on the simulator in a balanced manner to execute pooling calculation operation. Thus, the available computing resources on the simulator are fully utilized to improve the resource utilization rate.
Before embodiments of the invention are explained in detail, some terms used therein will be explained:
the image data is an intermediate output result (Feature Map) of the neural network, and the intermediate output result needs to be pooled.
The scale of the image data refers to the pixel scale and the channel scale of the image data, the pixel scale comprises the width and the height, and the channel scale comprises the number of channels.
According to an embodiment of the present invention, referring to fig. 1, there is provided a method of performing a pooled computational operation on a simulator, including steps S1, S2, S3, S4. For a better understanding of the present invention, each step is described in detail below with reference to specific examples.
Step S1: acquiring the scale of image data needing pooling and resource information of specified processing units which can be used for pooling at this time on a simulator, wherein the resource information at least comprises the number of the specified processing units.
According to one embodiment of the present invention, assuming that the user is performing performance analysis on the designed neural network model, the neural network model may be run on a general-purpose computer in order to complete the design work more quickly, and the general-purpose computer is connected to a simulator executing the method of the present application to accelerate the operation process of the neural network model. For example, the general purpose computer is configured to send at least the image data that needs pooling to the simulator for auxiliary processing, involving data transfer and coordination between the DDR of the general purpose computer and the on-chip memory spm (rasterized Pad memory) of the simulator. For example, the size (e.g., byte-by-byte) and the transmission mode of data transmitted between the DDR and the SPM (rasterized Pad memory) by configuring the DMA channel include constants required for calculation, input data (i.e., image data) from the DDR to the SPM, and output data (i.e., pooling result) from the SPM to the DDR. It should be understood that the simulator of the present application may be a dpu (Deep learning Processor unit) simulator, a gpdpu (general Purpose pure learning Processor unit) simulator, or any other simulator capable of implementing the method of the present application, and the present invention is not limited thereto.
In accordance with one embodiment of the present invention, the dimensions of the image data to be pooled include width, height, and number of channels. The application proceeds with: the form of width × height × number of channels represents the scale of the image data. Such as: 13 × 13 × 128 represents image data of 128 channels each having 13 pixels in width and height; it should be understood that this is by way of illustration only and the invention is not limited in this regard.
At present, the simulator is often utilized by directly distributing image data to be processed to a plurality of processing units, and the distribution of the image data is not reasonably planned in advance according to the number of available processing units, so that the computational resource distribution in the simulator is uneven, and the resource utilization is not high. In order to utilize the resources of the simulator to a greater extent, reduce the time for a user to develop a model, and improve the resource utilization rate, according to an embodiment of the present invention, before image data is allocated, resource information of a processing unit (PE for short) available for pooling at this time on the simulator is obtained, so that image data can be allocated better according to the resource information. Preferably, the resource information includes at least the number of the designated processing units. For example, the number of designated processing units available for the pooling is 16, 32 or 128 on the simulator, and it should be understood that the present invention is only illustrative and not limited thereto.
In addition, it should be noted that the application fields to which the method of the present application is applicable are general, for example, classification models in the medical field, recognition models in the target recognition field, etc., and the pooling process of any neural network model that needs pooling can be adopted by the method of the present application.
Step S2: and carrying out scale adjustment on the image data according to the resource information of the appointed processing unit which can be used for the current pooling on the simulator to obtain the adjusted image data, so that the pooling calculation operation on the adjusted image data can be uniformly distributed on the appointed processing unit.
According to one embodiment of the invention, the scaling includes adjusting a pixel size or a channel number of the image data to enable averaging of the image data at a pixel level and/or at a channel level by the designated processing unit. The adjustment of the pixel size of the image data is achieved, for example, by adding all zero pixels at the pixel edge of the image data. Also for example, the number of channels of the image data is adjusted by supplementing all-zero channels in the image data. Preferably, in the present application, a specified processing unit may represent or be replaced with a specified number of processing units. In order to ensure that pooling results corresponding to the image data can be subsequently correlated, a mapping relationship between the image data and the adjusted image data is established during the resizing, the mapping relationship including an indication of the position of each element in the pooling results of the image data in the intermediate pooling results of the adjusted image data. For example, the mapping relationship is indicated by an offset (offset on width, height, and/or channel) of each element in the pooling result of the image data at the middle pooling result of the adjusted image data. For example, assuming that the position of an element (pixel point) in the pooling result of the image data before the original adjustment is (width, height, channel) is (2, 2, 9), assuming that 24 all-zero channels are inserted between the original channel 8 and the channel 9, and the offset of the element after the scaling adjustment of the inserted image data is (0, 0, 24), the position of the value of the element in the intermediate pooling result of the adjusted image data is (2, 2, 33), and thus the pooling value of the element can be extracted. In this way, pooling values for other elements may be extracted from the intermediate pooling result of the image data, thereby restoring the pooling result of the image data.
In some cases, the image data may be just able to be equally divided, so that, in actual application, it can be selectively processed according to actual situations to improve resource utilization, according to an embodiment of the present invention, the method further includes: determining whether the image data can be equally divided by the appointed processing unit at least one of the pixel level and the channel level according to the resource information of the appointed processing unit which can be used for the current pooling on the simulator; if not, carrying out scale adjustment on the image data in at least one layer of the pixel layer and the channel layer; if so, no actual scaling of the image data is performed. For example, in the image scaling, the number of complements of the width, height, and number of channels is 0. According to one embodiment of the invention, additional all-zero channels are inserted at intervals in the original channel of the image data when the image data is scaled at the channel level. Thereby, the original data of the image data is more evenly distributed.
According to an embodiment of the present invention, the resource information further includes the number of data of one operation of the SIMD instruction in the designated processing unit for this pooling, wherein the scaling includes: and supplementing all zero channels in the image data according to the resource information of the specified processing unit available for the current pooling on the simulator and the number of data of one-time operation of the SIMD instruction used for the current pooling in the specified processing unit, so that the number of the channels of the adjusted image data is an integral multiple of the product of the number of the specified processing units and the number of the data of one-time operation of the SIMD instruction used for the current pooling. Preferably, the integer multiple is a minimum integer multiple. For example, if the number of designated processing units is 2, the SIMD instruction for the current pooling in the designated processing units is SIMD32, and the number of data indicating one operation of SIMD32 is 32, the product of the number of designated processing units and the number of data for the current operation of the SIMD instruction for the current pooling is: 2 × 32 ═ 64; assuming that the original number of channels of the image data is 48, the number of channels is adjusted by the scaling in step S2, and the adjusted image data is 64 channels.
According to an embodiment of the present invention, the number of channels of the image data may be exactly equal to the number of data in the designated processing unit for one operation of the current pooled SIMD instruction, and at this time, each processing unit may perform parallel computation on all channels of each image data at a time, but if only one processing unit is allocated, other processing units in the designated processing unit may be idle, resulting in low resource utilization; thus, according to an embodiment of the invention, the method further comprises: and determining whether the channel number of the image data is equal to the data number of one-time operation of the SIMD instruction for the current pooling or not according to the resource information of the designated processing unit which can be used for the current pooling on the simulator, and if so, performing scale adjustment on the image data on a pixel layer. For example, if the SIMD instruction for this time of pooling in the designated processing unit is SIMD32, which means that the number of data for one operation of SIMD32 is 32, and the scale of the image data is 55 × 55 × 32, the number of channels of the image data (i.e., 32) is exactly equal to the number of data for one operation of the SIMD instruction for this time of pooling in the designated processing unit (i.e., 32). In this case, to further improve the resource utilization, the width or height of the image data may be adjusted to distribute the image data to the processing units in a balanced manner in the width or height level. For example, the image is adjusted to 65 × 55 × 32.
Step S3: and the adjusted image data is distributed to a processing unit appointed on a simulator in an equalizing way to execute pooling calculation operation.
In order to utilize the resources of the simulator more efficiently, according to one embodiment of the present invention, the adjusted image data is distributed equally to the processing units designated on the simulator at the adjusted scale level to perform the pooling calculation operation. For example, if the number of channels is adjusted in step S2, the adjusted image data is allocated to the processing unit specified on the simulator in the channel level to perform the pooling operation. For example, in the adjusted image data (adjusted to 64 channels by 48 channels) of the foregoing embodiment, each 32 channels are assigned to one processing unit. For another example, if the pixels are adjusted in step S2, the adjusted image data is allocated to the processing unit designated on the simulator in the pixel level to perform the pooling operation.
According to one embodiment of the present invention, the step of performing pooling calculation operations by equal distribution of the adjusted image data to designated processing units on a simulator comprises: and distributing computing resources and storage resources on a designated processing unit for the adjusted image data, and performing pooling computing operation on the adjusted image data in parallel through the SIMD instruction to obtain an intermediate pooling result.
Step S4: and acquiring a middle pooling result obtained by performing pooling calculation operation on the adjusted image data, and extracting a pooling result of the image data from the middle pooling result.
According to one embodiment of the invention, an intermediate pooling result obtained by performing a pooling calculation operation on the adjusted image data by the simulator is obtained, and a pooling result of the image data is extracted from the intermediate pooling result according to the mapping relation. And according to the mapping relation, the positions of all elements in the pooling result corresponding to the image data in the middle pooling result can be found, and data is extracted from the positions to form the pooling result of the image data. The pooled results of the image data are transmitted to a device (e.g., a general purpose computer or other processing unit of a simulator) that performs the next computational task of the neural network model computation.
Two examples are given below from the channel level scaling and the pixel level scaling, respectively, to better illustrate the technical principles of the present application. It should be understood that the scale of the image data to be pooled output from the middle of different neural network models varies, and the following example is only illustrative and is not intended to limit the scope of the present invention.
Example 1 (channel level scaling):
one, 13X 128 diagram pooling implementation
First, taking an image data with a size (i.e., scale) of 13 × 13 × 128 as an example, assuming that 16 operational elements (corresponding to designated processing units on the simulator that can be used for this pooling) are available for pooling, SIMD32 instructions are employed; the size of a pooling layer of the neural network model is a window of 3 multiplied by 3, the step length is 2, the output result obtained by calculation is 6 multiplied by 128, 3 calculation tasks (Task) are designed and used on a simulator, and each calculation Task simultaneously carries out 2 rows of pooling operation; in addition, in order to use 16 arithmetic elements as much as possible to improve the utilization rate of resources, by dividing 128 channels into 16 parts, each PE can process 8 channels, and in order to perform processing using SIMD elements, 32 operations can be performed at a time, but 8 channels are not satisfied, and thus data needs to be expanded. Referring to fig. 2, by inserting 24 sets of data (corresponding to 24 channels) of size 13 × 13, where all 0 data are 24 groups of data, the 13 × 13 × 128 graph is expanded into a 13 × 13 × 512 graph, and in the graph, the data in the form of (a, b) represents the index of a pixel point, for example, (1,3) represents a pixel with a height value of 1 and a width value of 3, and the rest is similar, and will not be described herein again. Although many data extensions with 0 values, that is, null values, are added to the original data, the addition of these null values does not affect the computation speed, and instead, the portions of the data that are more reasonable in layout and divided into each processing unit are the same.
In specific implementation, the above scheme generates the adjusted image data in DDR in data _ generation.c in a data conversion module (SPM _ data) of the simulator on-chip memory, and then transmits the adjusted image data into the on-chip memory SPM (scanned Pad memory) in the test file testrun.c by SIMD transmission, so as to obtain the data arrangement in SPM as shown in fig. 3.
At the end of the data transmission of the experimental procedure of this example, the applicant designed an implementation of the Kernel (Kernel) of the simulator, for example, the implemented Kernel instantiates 4 × 4 ═ 16 logical nodes (nodes, corresponding to the specified processing units), and each Node cycles to compute 2 rows of values in the output image after the pooling computing operation. Each of the above logical nodes is mapped onto a 4 × 4 PE array (the processing unit is the designated processing unit), and is executed by the PE array. Considering that the current PE execution unit is SIMD32, a pooled computational operation for one image per SIMD branch is computed, and then each task execution completes a pooled partial result of 32 input images (2 lines of result data in 32 Output Feature maps) at a time. The currently implemented pooling computing operation kernel may consist of 2 subtasks (subtasks) per task, and the implementation of these two subtasks is described in detail below, respectively, assuming pooling to be an average pooling.
Subtask 1(Subtask1) load data (load _ copy, corresponding to LDN)
The main function of subtask1 is to load the constants needed for the average pooling (i.e. the reciprocal of the divisor, e.g. pooling layer of x3, then divisor 9, reciprocal of divisor 1/9) into the local registers of the PE array.
As shown in fig. 4, the subtask1 is composed of class 3 instruction templates (inst _ block templates), which are csv0, csv1, and csv2, and the class 3 instruction templates have the following functions:
csv 0: the reciprocal of the divisor stored in SPM (assumed to be data a) is loaded and the data is copied to the downstream node.
Csv 1: the data obtained from the upstream node is Copied (COPY) to the downstream node.
Csv 2: similar to null operations (no null instruction, ADD instead, but no specific operation is performed, as long as the operation does not affect the value in the register), belong to leaf nodes.
Subtask 2(Subtask2) load data average (ld _ copy _ computer)
The main function of the subtask2 is to load the data in a pooling window, sum and average the data, and store the resulting data back into the SPM. In this example, the result is that 1 line is obtained by looping 6 times, and the loop number is (the width of the input image-the convolution kernel width of the pooling layer)/step + 1. Wherein, the input image is image data; the width of the convolution kernel of the pooling layer, i.e. the size of the pooling kernel used for the size calculation operation, is for example 3 for a pooling layer of 3 x 3.
As shown in fig. 5, the sub-task 2 is composed of 1 instruction template (inst _ block template), and the function of the template is as follows:
csv3 loads data (corresponding to LDN) in a pooling window, sums and averages the data (corresponding to CMPT), stores the resulting data back into SPM (corresponding to ST), 16 processing units (PE) work in parallel, each processing unit (PE) can be set in a task to simultaneously calculate two lines of output results, 16 processing units (PE) obtain two lines of output results in 128 channels in a task, and 3 tasks can complete all calculations.
Of course, the position of data fetch at each task and the position of a different processing unit in one task at the time of data fetch are set. For example, the first task takes the data at the position of 1-5 lines from top to bottom of the image data, the second task takes the data at the position of 5-9 lines from top to bottom of the image data, and the third task takes the data at the position of 9-13 lines from top to bottom of the image data; the 1-16 processing units respectively take data of 1-32, 33-64, 65-96, … …, 480-512 channel positions. The fetching of the same processing unit per cycle and the address where the result of the calculation is stored within a task are also set. I.e. the address to which each element of the intermediate pooling result is saved, is yyyy, for subsequent restoration of the pooling result of the image data according to the mapping relationship. For example, assuming that the position in the intermediate pooling result of the adjusted image data is the result presence address yyyy of the element of (2, 2, 33) and the corresponding offset amount is (0, 0, 24), the pooling value of the element may be taken out from yyyy when restoring the value of the position of (2, 2, 9) in the pooling result of the image data not before adjustment. In this way, the pooling values of the other elements may be extracted from the intermediate pooling result of the image data, thereby restoring the pooling result of the image data.
After the computed result is obtained in the SPM, the result is transmitted from the SPM back to the DDR in a single channel simple transmission mode. After the DPU simulator program is executed, the ARM module is responsible for copying 12MB data (the size of 12MB is temporarily fixed and copied, and can be modified and adjusted according to actual conditions) starting from the SPM _ DDR _ ADDR in the DDR module of the simulator, generating a spu _ out file, extracting simulator execution result data from the file, and comparing the simulator execution result data with a theoretical output result. After the results are correct, a result is obtained, such as the utilization of computing resources under the pooling scheme of example 1 shown in Table 1.
TABLE 1
APP(0) Initialization time (575) execution time (1538) idle time (0)
PE The utilization rate of computing resources is 21.12 percent
DMA First start time 50194
DMA Final end time 80848
DMA All switching times 3875
Arm All execution times 78870
END TIME 86540
Over!
From this example, it can be seen that the resource utilization of the method of the present application can reach 21.12% for pooling calculations of 13 × 13 × 128 image data. The resource utilization rate of the existing method adopted by the applicant for pooling the image data with the same size is about 15%, so that the resource utilization rate of the method applied to the application is improved.
Example 2 (pixel level scaling):
pooling implementation of 55 × 55 × 32 graphs
This example takes as an example image data having a size scale of 55 x 32, assuming that 16 computing components (corresponding to designated processing units on the simulator that are available for this pooling) are available for pooling, using SIMD32 instructions; the dimension of the pooling layer of the neural network model is 3 × 3, the step size is 2, the output result obtained by calculation is 27 × 27 × 32, and because the number of channels is 32, one PE can complete the calculation of each channel in parallel, but in order to use 16 calculation components as much as possible, the utilization rate of resources is improved. Although the equalization distribution can be performed at the pixel level, since the output result 27 × 27 × 32 is not well distributed to 16 PEs on average, it is considered that the output is 32 × 27 × 32 as much as possible, and therefore, according to this idea, a scheme is considered to expand the original 55 × 55 × 32 data to the size of 65 × 55 × 32, and this data expansion method is slightly different from the above, and the data expansion is performed for each channel (expansion of the pixel size). For this, the pixel size of the input image data can be expanded from 55 × 55 to 65 × 55, since (65-3)/2 +1 equals 32, (55-3)/2+1 equals 27, and the extra portion is supplemented with 0. As shown in fig. 6.
According to the idea of data alignment and null filling value and the idea of evenly dividing the processed data into each processing unit, 1 Task (Task) is designed and used, each PE simultaneously performs 2 rows of pooling operation in one Task, because 16 operation parts are used for improving the utilization rate of resources and SIMD instructions are used for processing, 32 graphs can be directly processed once, 27 times of circulation are performed, and 16 tasks can directly complete all pooling operation.
The Kernel (Kernel) of the simulator is implemented similarly to example 1, and the only change is that the offset addresses of the store and fetch for each PE are different in one task (i.e., the mapping relationship is different), which is not described herein again. The utilization table 2 of the computing resources under the pooling scheme of example 2 is shown.
TABLE 2
APP(0) Initialization time (279) execution time (4438) idle time (0)
PE The utilization rate of the computing resources is 10.85 percent
DMA First start time 15893
DMA At the end of the last runMeta-32590
DMA All switching time 4785
Arm All execution times 27071
END TIME 38300
Over!
From this example, it can be seen that the resource utilization of the method of the present application can reach 10.85% for pooling calculations of 55 × 55 × 32 image data. While the resource utilization of the prior art method adopted by the applicant is less than 10% on the same-size graphs, although the resource utilization is different for different graph sizes, the scheme of the present application has improved the utilization of the components for the same-size graphs, and the scheme can be seen in the results of running on the simulator.
By combining the two examples, the method realizes the improvement of the resource utilization rate of the pooling operation of the 55 × 55 × 32 graph and the 13 × 13 × 128 graph on the simulator through two expansion modes of data, but the method can be expanded to the use of any standard graph, and the method contributes to the efficient calculation of advancing the pooling layer on actual hardware.
The two examples mainly perform recurrence and arrangement on the average pooling layer, and provide a pooling scheme for data alignment and null filling and a scheme for reasonable data distribution processing units, and the main function of the programmed task (bottom library function) is to complete addition and averaging. However, it should be understood that other pooling manners, such as maximal pooling, minimal pooling, etc., may be implemented with slight modifications based on the above examples, and the invention is not limited in this respect.
It should be noted that, although the steps are described in a specific order, it is not meant that the steps must be executed in the specific order, and in fact, some of the steps may be executed concurrently or even in a different order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may comprise a computer readable storage medium having computer readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device such as a punch card or an in-groove raised structure having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. A method of performing pooled computational operations on a simulator, comprising:
acquiring the scale of image data to be pooled and resource information of a designated processing unit which can be used for current pooling on a simulator, wherein the resource information at least comprises the number of the designated processing units;
carrying out scale adjustment on the image data according to resource information of a designated processing unit which can be used for pooling at this time on a simulator to obtain adjusted image data, so that pooling calculation operation on the adjusted image data can be evenly distributed on the designated processing unit;
and the adjusted image data is distributed to the designated processing unit on the simulator in an equalizing way to execute pooling calculation operation.
2. The method of claim 1, wherein the scaling comprises adjusting a pixel size or a number of channels of the image data to average the image data across processing units that can be specified at a pixel level and/or at a channel level.
3. The method of claim 2, wherein the resizing the pixels of the image data is achieved by adding all-zero pixels at the edges of the pixels of the image data; alternatively, the number of channels of the image data is adjusted by appending all-zero channels to the image data.
4. The method of claim 2, further comprising:
determining whether the image data can be equally divided by the appointed processing unit at least one of the pixel level and the channel level according to the resource information of the appointed processing unit which can be used for the current pooling on the simulator; if not, the image data is resized at least one of the pixel level and the channel level.
5. The method of claim 4, wherein the step of performing a pooling calculation operation of the equalized distribution of the adjusted image data to designated processing units on a simulator comprises:
and at the adjusted scale level, the adjusted image data is distributed to a processing unit designated on a simulator in an equalizing way to execute pooling calculation operation.
6. The method according to any of claims 1-5, wherein said resource information further comprises the number of data in a given processing unit for one operation of the current pooled SIMD instruction,
wherein the scaling comprises:
and supplementing all zero channels in the image data according to the resource information of the appointed processing unit which can be used for the current pooling on the simulator and the number of data of the single instruction multiple data Storage (SIMD) instruction for the current pooling in the appointed processing unit, so that the number of the channels of the adjusted image data is an integral multiple of the product of the number of the appointed processing unit and the number of the data of the single instruction multiple data Storage (SIMD) instruction for the current pooling.
7. The method according to any one of claims 1-5, further comprising:
determining whether the number of channels of the image data is equal to the number of data of one-time operation of the SIMD instruction for the pooling according to the resource information of the designated processing unit available for the pooling, if so, performing scale adjustment on the image data on a pixel layer;
and, at the pixel level, the adjusted image data is evenly distributed to a processing unit designated on the simulator to perform pooling calculation operations.
8. The method according to any one of claims 1-5, further comprising:
and acquiring a middle pooling result obtained by performing pooling calculation operation on the adjusted image data, and extracting a pooling result of the image data from the middle pooling result.
9. A computer-readable storage medium, on which a computer program is stored which is executable by a processor for carrying out the steps of the method according to any one of claims 1 to 8.
10. An electronic device, comprising:
one or more processors; and
a memory, wherein the memory is to store executable instructions;
the one or more processors are configured to implement the steps of the method of any one of claims 1-8 via execution of the executable instructions.
CN202210440779.3A 2022-04-25 2022-04-25 Method for performing pooling calculation operation on simulator Pending CN114860433A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210440779.3A CN114860433A (en) 2022-04-25 2022-04-25 Method for performing pooling calculation operation on simulator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210440779.3A CN114860433A (en) 2022-04-25 2022-04-25 Method for performing pooling calculation operation on simulator

Publications (1)

Publication Number Publication Date
CN114860433A true CN114860433A (en) 2022-08-05

Family

ID=82633424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210440779.3A Pending CN114860433A (en) 2022-04-25 2022-04-25 Method for performing pooling calculation operation on simulator

Country Status (1)

Country Link
CN (1) CN114860433A (en)

Similar Documents

Publication Publication Date Title
JP6858239B2 (en) Compiler techniques for mapping program code to high-performance, power-efficient, programmable image processing hardware platforms
US10531030B2 (en) Block operations for an image processor having a two-dimensional execution lane array and a two-dimensional shift register
CN110149802B (en) Compiler for translating between virtual image processor Instruction Set Architecture (ISA) and target hardware with two-dimensional shift array architecture
KR102278658B1 (en) Architecture for high performance, power efficient, programmable image processing
US10216487B2 (en) Virtual image processor instruction set architecture (ISA) and memory model and exemplary target hardware having a two-dimensional shift array structure
US11175920B2 (en) Efficient work execution in a parallel computing system
JP6837084B2 (en) Core process for block processing on image processors with 2D execution lane arrays and 2D shift registers
JP2020518042A (en) Processing device and processing method
CN108399594B (en) Efficient data selection for a processor
CN108470211B (en) Method and device for realizing convolution calculation and computer storage medium
US8941674B2 (en) System and method for efficient resource management of a signal flow programmed digital signal processor code
US11733983B2 (en) Method and apparatus for generating metadata by a compiler
WO2024027039A1 (en) Data processing method and apparatus, and device and readable storage medium
US20200320367A1 (en) Graph Conversion Method
US10152310B2 (en) Fusing a sequence of operations through subdividing
WO2016024508A1 (en) Multiprocessor device
CN110490308B (en) Design method of acceleration library, terminal equipment and storage medium
Altoyan et al. Investigating performance losses in high-level synthesis for stencil computations
WO2019141160A1 (en) Data processing method and apparatus
CN114860433A (en) Method for performing pooling calculation operation on simulator
Schellekens Automatically Scheduling Halide-HLS
US11630667B2 (en) Dedicated vector sub-processor system
KR20240063137A (en) Hardware accelerator-optimized group convolution-based neural network model
KR20230160893A (en) Select and run wavefront
KR20200121788A (en) Processor architectures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination