CN115390923A

CN115390923A - Multi-mode-based SIMD application efficient execution method and system

Info

Publication number: CN115390923A
Application number: CN202210843537.9A
Authority: CN
Inventors: 汤胜中; 范志华; 李文明; 安学军; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-11-25

Abstract

The invention provides a method and a system for efficiently executing SIMD (single instruction multiple data) application based on multiple modes, which comprises the following steps: an architecture for flexibly controlling SIMD components to run at different granularities enables the SIMD components to maintain efficient utilization in different applications and in the same application but different scales. And a method for searching the optimal granularity of the application and generating the mapping scheme applied under the corresponding granularity, so as to fully exert the capability of the multi-mode SIMD component. Such a multi-mode SIMD component can be applied to various types of chips.

Description

Multi-mode-based SIMD application efficient execution method and system

Technical Field

The invention relates to the technical field of computer architectures, in particular to a multi-mode-based SIMD application efficient execution method and system.

Background

SIMD (Single Instruction Multiple Data) is a common technique in computer architectures. By adding an execution unit, one instruction can process a plurality of data simultaneously, thereby greatly increasing the parallel capability. The increase in the number of arithmetic elements doubles the peak arithmetic performance of the chip. In a high-throughput application scenario, especially in a situation where many data need to perform the same operation, such as processing multimedia data, the performance gain can be sufficiently mined, so that the performance gain can be several times that of a conventional computer.

SIMD technology has found widespread use both in computer chips and in some embedded chips. For example, a SEE instruction commonly used in CPUs is a SIMD instruction of 128 bits. And in recent years, the AVX-512 instruction has expanded the SIMD width to 512 bits. Larger SIMD widths mean larger and larger power consumption. On the other hand, the optimum SIMD width is different for each specific application, method. Wider SIMDs mean that there is a greater likelihood of facing a lot of waste.

The conventional solutions to this problem generally have two types: one is to complement the data with 0, which has the advantage of simple implementation but the disadvantage of wasting the component computing resources. The more complementary 0 s, the more waste. The other is to turn off the inactive SIMD branches by gating techniques, thereby reducing power consumption. This approach solves the power consumption waste problem, but also at the cost of performance.

Disclosure of Invention

The invention aims to solve the problem of low efficiency brought by the existing solution method under the condition that the task force is less than the work force of an SIMD (single instruction multiple data) component, and provides a dynamically reconfigurable SIMD architecture, and a method for searching the optimal SIMD granularity and a method for dynamically Kernel mapping matched with the same.

Aiming at the defects of the prior art, the invention provides a method for efficiently executing a multi-mode-based SIMD application, which comprises the following steps:

step 1, obtaining the SIMD working granularity of all the computing platforms according to the SIMD architecture characteristics of the computing platforms, and selecting the working granularity with the maximum granularity as the current working granularity;

step 2, obtaining the maximum parallel granularity of the application according to the parameter configuration of the application to be executed, and using the maximum parallel granularity as the residual parallel granularity;

step 3, judging whether the residual parallel granularity is larger than or equal to the current working granularity or whether the result of subtracting the residual parallel granularity from the current working granularity is smaller than a preset value, if so, adding the current working granularity as an execution scheme into a set, reducing the current working granularity through the residual parallel granularity, updating and replacing the residual parallel granularity, executing step 4, otherwise, selecting the SIMD working granularity lower than the current working granularity, updating and replacing the current working granularity, and executing the step 3 again;

step 4, judging whether the residual parallelism is greater than 0, if so, executing the step 3 again, otherwise, executing the step 5;

step 5, removing the execution scheme with the maximum working granularity in the set, and taking the execution scheme as the current execution scheme; determining the number of programs running in parallel according to the distribution granularity of the execution scheme and the number of nodes of the data flow graph of the application to be executed; traversing all the mappable logic PEs of the computing platform for each node in the data flow graph, and selecting the mappable logic PE with the shortest distance to the PE of the upstream node and the downstream node as a mapping scheme of a logic PE array;

and 6, judging whether the set has an execution scheme, if so, executing the step 5 again until all the execution schemes have the mapping scheme, generating a plurality of logic PEs by each PE in the computing platform according to the configuration of the mapping scheme, and executing the application to be executed by the computing array formed by the logic PEs to obtain an execution result.

The efficient execution method of the SIMD application based on multiple modes is described, wherein the step 5 comprises: dividing the SIMD component number of the PE by the distribution granularity of the execution scheme to obtain the logic PE number which can be configured by each PE, multiplying the configured logic PE number by the total PE number of the computing platform to obtain the logic PE number of the computing platform, and dividing the logic PE number by the node number of the data flow graph of the application to be executed to obtain the number of Kernel elements which run in parallel.

The efficient execution method based on the multi-mode SIMD application is characterized in that the computing platform is a coarse-grained reconfigurable array CGRA platform; the application to be executed is a highly parallel image processing, neural network or matrix manipulation application.

The efficient execution method of the SIMD application based on the multi-mode is characterized in that the step 2 comprises the steps of searching out the maximum SIMD granularity configuration of the application to be executed as the maximum parallel granularity by a greedy method during compiling according to the parameter configuration of the application to be executed;

this step 5 comprises generating a dataflow graph of the application to be executed by compiling the application to be executed by a compiler.

The invention also provides a high-efficiency execution system of the SIMD application based on the multiple modes, which comprises the following steps:

the module 1 obtains the SIMD working granularity of all the computing platforms according to the SIMD architecture characteristics of the computing platforms, and selects the working granularity with the maximum granularity as the current working granularity;

the module 2 obtains the maximum parallel granularity of the application according to the parameter configuration of the application to be executed, and the maximum parallel granularity is used as the residual parallel granularity;

module 3, judge whether the residual parallel granularity is greater than or equal to the current working granularity or the result of subtracting the residual parallel granularity from the current working granularity is less than the preset value, if yes, add the current working granularity as the execution scheme into the set, subtract the current working granularity by the residual parallel granularity, update and replace the residual parallel granularity, call module 4, otherwise select the SIMD working granularity lower than the current working granularity, update and replace the current working granularity, call module 3 again;

module 4, judge whether the remaining parallelism is greater than 0, if yes, call module 3 again, otherwise call module 5;

the module 5 removes the execution scheme with the maximum working granularity in the set and takes the execution scheme as the current execution scheme; determining the number of programs running in parallel according to the distribution granularity of the execution scheme and the number of nodes of the data flow graph of the application to be executed; traversing all the mappable logic PEs of the computing platform for each node in the data flow graph, and selecting the mappable logic PE with the shortest distance to the PE of the upstream node and the downstream node as a mapping scheme of a logic PE array;

and a module 6, determining whether the set has any execution scheme, if so, calling the module 5 again until all the execution schemes have the mapping scheme, generating a plurality of logic PEs by each PE in the computing platform according to the configuration of the mapping scheme, and executing the application to be executed by the computing array formed by the logic PEs to obtain an execution result.

The multi-mode based SIMD application efficient execution system is described, wherein the module 5 comprises: dividing the SIMD component number of the PE by the distribution granularity of the execution scheme to obtain the number of logic PEs which can be configured by each PE, multiplying the number of the logic PEs which can be configured by the total number of the PEs of the computing platform to obtain the total number of the logic PEs of the computing platform, and dividing the total number of the logic PEs by the number of nodes of the dataflow graph of the application to be executed to obtain the number of Kernel of the parallel operation.

The multi-mode-based SIMD application efficient execution system is characterized in that the computing platform is a coarse-grained reconfigurable array CGRA platform; the application to be executed is a highly parallel image processing, neural network or matrix manipulation application.

The efficient execution system of the SIMD application based on the multi-mode is characterized in that the module 2 comprises a module for searching out the maximum SIMD granularity configuration of the application to be executed as the maximum parallel granularity by a greedy method during compiling according to the parameter configuration of the application to be executed;

the module 5 comprises a dataflow graph that generates the application to be executed by compiling the application to be executed by a compiler.

The invention also provides a storage medium for storing a program for executing any one of the multi-mode based SIMD application efficient execution methods.

The invention also provides a client used for the any one multi-mode-based SIMD application high-efficiency execution system.

According to the scheme, the invention has the advantages that:

compared with the prior art, when the optimal SIMD granularity of the application is smaller than the granularity of the physical SIMD component, the idle component can be effectively utilized, and the parallel capability, the peak value performance and the energy consumption ratio of the chip are improved. The larger the SIMD granularity of the application differs from that of the physical SIMD component, the more significant the magnitude of the increase.

Drawings

FIG. 1 is an architecture diagram of reconfigurable SIMD;

FIG. 2 is an architecture diagram of SIMD 16;

FIG. 3 is a flow diagram of an optimal SIMD granularity selection method;

FIG. 4 is a flow chart of a dynamic Kernel mapping method;

fig. 5 is a schematic diagram of the FSC operating state under an operating SIMD4 application.

Detailed Description

None of the prior art solutions to the problem take into account how to fully utilize the unused SIMD branches when the task granularity is smaller than the SIMD branch width. In order to adapt to flexible task granularity and fully utilize the performance of the SIMD component in the above-mentioned scenario, the present invention provides a SIMD dynamic reconfiguration architecture, which can be applied to any chip architecture using SIMD technology.

In the embodiment of the present invention, a CGRA (Coarse-Grained Reconfigurable Array) platform is taken as an example to implement the Reconfigurable SIMD architecture. CGRA is generally composed of a network on chip connection, a PE (Processing Element) array, a Host, a cache, and the like. CGRA has better programmability than FPGA (Field Programmable Gate Array), has more excellent power consumption performance than GPU and more general parallel capability. The framework provided by the invention can fully utilize the original non-working SIMD branches for expanding the parallelism degree, fully mining the hardware performance and improving the energy consumption ratio. In addition, in order to enable the whole new architecture to operate, the invention also provides a set of optimal SIMD granularity selection method and a set of dynamic Kernel mapping method matched with the hardware architecture.

The invention comprises the following key technical points:

key point 1, optimal SIMD granularity selection method; the technical effects are as follows: the work granularity of the application can be decomposed in the compiling stage into the combination of the work granularity most suitable for the reconfigurable SIMD architecture proposed by the present invention. The characteristic that the reconfigurable SIMD architecture provided by the invention supports different SIMD granularities can be fully exerted by applying the work granularity operation generated by the method, and the utilization rate of SIMD components is improved;

a key point 2, a dynamic Kernel mapping method; the technical effects are as follows: the method automatically calculates the scale of the corresponding logic PE array and the number of the logic PE arrays according to the granularity combination output by the optimal SIMD granularity selection method, thereby determining how many Kernels are executed at one time and the instruction mapping of each Kernel. The method fully exerts the characteristics of the logic PE array in the reconfigurable SIMD architecture by introducing more Kernels, and improves the utilization rate of the SIMD components. (ii) a

Key point 3, reconfigurable SIMD architecture; the technical effects are as follows: the SIMD unit can work with different granularities and can be virtualized to work with a plurality of SIMD logic units, thereby increasing the utilization rate of the SIMD unit under small granularity scene. The SIMD architecture can be applied to SIMD components in any architecture, and has wide applicability.

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention mainly comprises three parts:

1. the search method applying the best SIMD granularity: depending on the parameters of a certain application, the largest SIMD granularity configuration suitable for this application is searched out by a greedy approach during compilation. This configuration will be used in the PE dynamic mapping method to generate the mapping configuration for the PE; the application may be, for example, a highly parallel image processing, neural network, or matrix manipulation application.

2. The dynamic PE mapping method comprises the following steps: calculating the mapping mode of the PE by inputting configuration information generated by the SIMD granularity selection method and the applied data flow graph;

3. dynamically reconfigurable SIMD hardware architecture: the SIMD dynamic architecture illustrated in the present invention is mainly implemented on CGRA, but its application scope is not limited to CGRA. The CGRA is mainly composed of a PE array, a Host, a Buffer and an on-chip interconnection network. Whereas the PE under the present invention is composed of a BPU and an FSC. The BPU is the basic arithmetic unit, and the FSC is the control unit of the hierarchical tree organization. Through the control of BPU by FSC, a PE can form a plurality of logic PEs.

The invention provides a dynamic reconfigurable SIMD architecture diagram, which only takes a CGRA architecture as an example platform, and is shown in FIG. 1. The organization of the various SIMD branch units within one physical PE unit is shown. A certain number of BPU (Basic processing unit) and FSC (Flex-SIMD controller) are included. Wherein each BPU is a branch of SIMD with arithmetic units. The FSCs are organized in a tree, in fig. 1, each parent node of the FSC tree has 2 child nodes, but in practice the number of child nodes may vary as desired, but the following description will generally control two child nodes per FSC for example. All BPUs are on the leaf nodes of the FSC, controlled by the FSC.

The structure of the FSC tree is determined by the SIMD component width and the ability of one FSC node to control several child nodes. For example, the number of branches of a SIMD component is 2 ⁿ The corresponding architecture in the present invention means that the BPU is 2 ⁿ It is sufficient to let the depth of the FSC tree be n at this time to control 2 ⁿ And a BPU. Another method is to increase the width of the FSC number, so that each FSC can be controlled by 2 ^k A node, then the FSC number need only reach n/k depth enough to control 2 ⁿ And a BPU. FIG. 2 is an exemplary diagram of an FSC tree structure when each parent node of the FSC has 2 child nodes when SIMD width is 16.

By controlling the BPUs in a hierarchical manner, a decision can be made as to which BPUs perform the same operation and which BPUs perform different operations. The upper FSC receives control information (by the first two methods of the present invention)Generated) and then distribute the information to the FSCs of its children. When the number of SIMD branches is 2 ⁿ When it is desired to control the SIMD components to SIMD2 ^k Can find that the FSC of the (n-k + 1) layer manages the BPU just 2 ^k And (4) respectively. The FSC at this layer will control its sub-FSCs so that all BPUs it manages perform the same operation. The set of BPUs performing the same operation behaves as a logically independent SIMD component. While FSCs at the (n-k + 1) level are assigned different tasks, which is equivalent to a SIMD2 ⁿ The component is virtualized into multiple SIMD2 ^k The component operates. More specifically, the SIMD branch number is 16, and the state of operation in SIMD4 mode is shown in fig. 5: the FSC component marked black in the figure represents and works, and the tasks distributed by the FSC of the upper layer are distributed to the child nodes controlled by the FSC as they are.

It can be found that in the above example, the (n-k + 1) layers have a total of 2 ^n-k FSCs performing different operations, the entire PE corresponding to 2 ^{n-k is} Logical SIMD2 ^k Component, also considered as a physical PE, has 2 ^n-k A logic PE having a SIMD2 per logic PE ^k And (4) parts. The whole logic PEarray also has 2 of the physical PE number of the logic PE ^n-k Multiple, equivalent to it running SIMD2 ^k The peak performance of the granularity task can reach 2 of the original structure ^n-k And (4) multiplying.

The specific working process of the invention is as follows:

for an application, the SIMD granularity configuration for an application is first determined during compilation using the optimal SIMD granularity selection method, according to its parameters.

This approach requires providing a list of SIMD modes that the current chip can support. The method firstly calculates the maximum parallel number of the application through the parameters of the application in the compiling stage. Then attempt to resolve this maximum parallel number in the SIMD mode support list in order from large to small granularity. Specifically, it is first attempted to continually allocate the application's parallelism to the maximum granularity in the supported mode, and if the current or remaining parallelism is already less than the operating mode granularity of this round, then in the next round it is attempted to decompose the application's remaining parallelism with a smaller operating mode granularity. By continuously repeating the above processes, a decomposition of the parallel number of the application under the existing operation mode granularity support list can be obtained. The decomposed configuration is a collocation of various SIMD work granularities that can be sufficiently close to the parallelism if the sum of the granularities is larger than the number of parallelism required by the application. The workflow of the method is illustrated below with reference to its steps in fig. 3:

step S101, firstly, according to the specific characteristics of the reconfigurable SIMD architecture, listing all the supported SIMD work granularities, and arranging the SIMD work granularities from big to small. Selecting the first mode (i.e., the largest granularity) as the mode for testing;

and step S102, calculating the maximum parallel granularity of the application according to the parameter configuration of the application. In the following steps, the parallel granularity is continuously distributed to different working modes until all the parallel granularities are distributed;

step S103, comparing the selected work granularity with the current remaining parallelism. If the remaining parallelism is greater than or equal to the currently selected work granularity, or if the work granularity is greater than the remaining parallelism but the work granularity and the remaining parallelism are very close to each other, then adding the work granularity to the configuration in step S106, and subtracting the granularity of the currently selected work mode from the remaining parallelism;

if the above condition is not met, i.e. in step S105, the next smaller work granularity should be selected, and the process is repeated from step S103;

step S107, judging the remaining parallelism: if the parallelism remains, the process jumps back to step S103, otherwise the process ends.

Next, according to the SIMD granularity configuration generated in the previous step and the applied dataflow graph, a dynamic Kernel (program running on CGRA) mapping method is used to generate a mapping configuration of the dataflow graph of Kernel to PE. The method first calculates the scale of the logic PE array according to the configuration information obtained by the previous method. And then calculate how many Kernel can be mapped at one time at this scale. And the last step is to complete the mapping from the data flow graph of the Kernel to the logic PE, and in the method, the corresponding PE is distributed to the nodes in the data flow graph according to the principle of minimum distance. The specific implementation steps are shown in fig. 4:

step S201, selecting the next conf in the SIMD granularity configuration set conf generated in the previous method. In the following steps, corresponding Kernel mapping configuration information is generated for this configuration.

Step S202, according to conf selected in S201 and the number of data flow graph nodes of Kernel, determining that several Kernels should be run in parallel to fully utilize the scale of the logic PE array. For example, when conf is SIMD4 and the PE array has 16 PEs, and the SIMD part of each PE is SIMD16, the number of data flow graph nodes of Kernel is 4. Since each PE can be virtualized into 4 logical PEs. The entire PE array has 64 logical PEs, and each Kernel occupies only 4 logical PEs. That is, 64/4=16 Kernels in parallel can run the entire array of logical PEs fully.

Step S203 completes the mapping of the dataflow graphs of all Kernels to the logic PE array. In this step, all the mapable logical PEs are traversed for each node, and the available PE with the shortest distance to the PE of the upstream and downstream nodes in the data flow graph is selected as the mapping selection. The choice of distance can reduce the overhead and latency of the communication.

One of the PEs maps at most one node, and one node also maps to only one PE. After each pair of nodes decides which PE it is mapped to, the PE is no longer available.

In step S205, it is determined whether there are any more confs, and if so, the process goes to step S201. Until the corresponding mapping information is generated for all confs.

Finally, the configuration information generated by the dynamic Kernel mapping method is transferred to the top-level FSC in each PE, and the FSC of each parent node continuously distributes control information to the FSCs of its child nodes to control whether the FSCs of its child nodes perform the same operation or different operations. For example, when configured in SIMD4 mode, the operating state of the entire FSC tree of SIMD16 components is as shown in fig. 5: only the FSCs of L1 to L3 are working because the 4 sets of BPUs governed by different L3-SIMD controllers operate differently. And the FSC of L4 is closed because 4 BPUs managed by different L4-SIMD controllers under the same L3-SIMD Controller are doing the same operation. At this time, the entire PE can simulate 4 logical PEs, and thus the number of working PEs for the entire PE array is increased by a factor of 4. This greatly enhances the parallelism capability of the hardware.

Wherein the plurality of configurations are performed sequentially in time, not in parallel, and only one configuration is performed at a time. That is, after the mapping mode and the cycle number corresponding to the first conf are executed, the mapping mode and the cycle number corresponding to the second conf are executed.

The following are system examples corresponding to the above method examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

module 3, judge whether the particle size of the residual parallelism is greater than or equal to the current working particle size or the result of subtracting the particle size of the residual parallelism from the current working particle size is less than the preset value, if yes, add the current working particle size as the execution scheme into the set, subtract the current working particle size by the particle size of the residual parallelism, update and replace the particle size of the residual parallelism, call module 4, otherwise, select the SIMD working particle size lower than the current working particle size, update and replace the current working particle size, call module 3 again;

the module 5 removes the execution scheme with the maximum working granularity in the set and takes the execution scheme as the current execution scheme; determining the number of programs running in parallel according to the distribution granularity of the execution scheme and the number of nodes of the data flow graph of the application to be executed; traversing all the mappable logic PEs of the computing platform for each node in the data flow graph, and selecting the mappable logic PE with the shortest distance to the PE of the upstream and downstream nodes as a mapping scheme of a logic PE array;

Claims

1. A multi-mode based SIMD application efficient execution method, comprising:

2. A method for efficient execution of a multi-mode based SIMD application according to claim 1, wherein the step 5 comprises: dividing the SIMD component number of the PE by the distribution granularity of the execution scheme to obtain the number of logic PEs which can be configured by each PE, multiplying the number of the logic PEs which can be configured by the total number of the PEs of the computing platform to obtain the total number of the logic PEs of the computing platform, and dividing the total number of the logic PEs by the number of nodes of the dataflow graph of the application to be executed to obtain the number of Kernel of the parallel operation.

3. A multi-mode based SIMD application efficient execution method according to claim 1, wherein the computing platform is a coarse-grained reconfigurable array CGRA platform; the application to be executed is a highly parallel image processing, neural network or matrix manipulation application.

4. A multi-mode based SIMD application efficient execution method according to claim 1, wherein the step 2 comprises searching out the maximum SIMD granularity configuration of the application to be executed as the maximum parallel granularity by a greedy method during compiling according to the parameter configuration of the application to be executed;

5. A multi-mode based SIMD application efficient execution system, comprising:

the module 1 obtains the SIMD working granularity of all the computing platforms according to the SIMD architecture characteristics of the computing platforms, and selects the working granularity with the largest granularity as the current working granularity;

and a module 6, determining whether the set has any execution scheme, if so, calling the module 5 again until all the execution schemes have the mapping scheme, generating a plurality of logic PEs by each PE in the computing platform according to the configuration of the mapping scheme, and executing the application to be executed by a computing array formed by the logic PEs to obtain an execution result.

6. A multi-pattern based SIMD application efficient execution system according to claim 5, wherein said module 5 comprises: dividing the SIMD component number of the PE by the distribution granularity of the execution scheme to obtain the number of logic PEs which can be configured by each PE, multiplying the number of the logic PEs which can be configured by the total number of the PEs of the computing platform to obtain the total number of the logic PEs of the computing platform, and dividing the total number of the logic PEs by the number of nodes of the dataflow graph of the application to be executed to obtain the number of Kernel of the parallel operation.

7. A multi-mode based SIMD application efficient execution system according to claim 5, wherein said computing platform is a coarse grain reconfigurable array CGRA platform; the application to be executed is a highly parallel image processing, neural network or matrix manipulation application.

8. A multi-mode based SIMD application efficient execution system according to claim 5, wherein said module 2 comprises means for searching out the maximum SIMD granularity configuration of said application to be executed as said maximum parallelism granularity by a greedy method during compilation based on the parameter configuration of said application to be executed;

the module 5 comprises a data flow graph for generating the application to be executed by compiling the application to be executed by a compiler.

9. A storage medium storing a program for executing the method for efficiently executing a multi-mode-based SIMD application according to any one of claims 1 to 4.

10. A client for use in the multi-mode based SIMD application efficient execution system of any one of claims 5 to 8.