WO2021227418A1 - 基于多板fpga异构系统的任务部署方法及设备 - Google Patents
基于多板fpga异构系统的任务部署方法及设备 Download PDFInfo
- Publication number
- WO2021227418A1 WO2021227418A1 PCT/CN2020/129554 CN2020129554W WO2021227418A1 WO 2021227418 A1 WO2021227418 A1 WO 2021227418A1 CN 2020129554 W CN2020129554 W CN 2020129554W WO 2021227418 A1 WO2021227418 A1 WO 2021227418A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subtasks
- consumption
- fpga
- running
- deployed
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 239000003550 marker Substances 0.000 claims description 8
- 230000001419 dependent effect Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 14
- 238000010586 diagram Methods 0.000 description 14
- 230000005540 biological transmission Effects 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- 230000006872 improvement Effects 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 239000013307 optical fiber Substances 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012804 iterative process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
- G06F9/4893—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues taking into account power or heat criteria
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Definitions
- the present invention relates to the technical field of heterogeneous computing, in particular to a task deployment method and equipment based on a multi-board FPGA heterogeneous system.
- a multi-board FPGA heterogeneous system using a pipeline scheme the overall task needs to be split into multiple subtasks, and each subtask is divided and deployed on each FPGA in the form of a pipeline.
- Most of the existing task division methods are simple division and deployment based on the surface characteristics of each subtask. For example, in a convolutional neural network, tasks are divided and deployed only according to the number of convolutional layers and fully connected layers. Divided deployment, which leads to a large imbalance and room for improvement in the entire multi-board FPGA heterogeneous system; and because the above method is a manual division method, it is not only subjective and arbitrary, but also requires time and effort to verify. Moreover, it cannot be applied to other execution tasks. When the execution task is changed, manual division is required again, which lacks versatility.
- the present invention provides a task deployment method based on a multi-board FPGA heterogeneous system, including: dividing the total task into a number of subtasks arranged in the order of task execution; calculating the running cost of each of the subtasks Quantity; according to the running consumption of each of the subtasks and the number of FPGA boards of the multi-board FPGA heterogeneous system, determine the running consumption constraint value corresponding to the FPGA of the subtask to be deployed in the multi-board FPGA heterogeneous system ; Under the constraint condition that the sum of the running consumption of the subtasks deployed on the FPGA of the subtask to be deployed is close to the corresponding constraint value of the running consumption, according to the dichotomous iterative method, from a number of the subtasks , By continuously dividing a number of the subtasks into two according to the task execution order, until a part of the divided subtasks meets the constraint conditions, to determine that the part of the subtasks are to be deployed in
- the number of subtasks to be deployed in the multi-board FPGA heterogeneous system is determined according to the running consumption of each of the subtasks and the number of FPGA boards in the multi-board FPGA heterogeneous system
- the operating consumption constraint value corresponding to the FPGA includes: calculating the sum of the operating consumption of a number of the subtasks divided by the maximum operating consumption of the calculated operating consumption to obtain the quotient; judging whether the number of FPGA boards is A round-up value greater than the quotient; if so, it is determined that the operating consumption constraint value is the maximum operating consumption; if not, it is determined that the operating consumption constraint value is the quotient value.
- An array wherein the subscript array is an arithmetic sequence with a tolerance of 1; construct a dichotomous target model with the subscript array as an independent variable; wherein the dependent variable of the dichotomous target model is the starting subscript To the sum of the running consumption of all subtasks corresponding to the independent variable minus the difference of the running consumption constraint value; obtain the FPGA on the subtask to be deployed according to the dichotomous target model and the starting index The endpoint target corner mark t of the subtask to be deployed.
- the operation includes updating the number of FPGA boards and the starting index, and returning to determine the subtasks to be deployed based on the running consumption of each of the subtasks and the number of FPGA boards of the multi-board FPGA heterogeneous system
- the operation consumption constraint value step corresponding to the FPGA of the task is to update the operation consumption constraint value.
- the judging whether the judgment point T is the endpoint target index t according to the magnitude relationship between the operating consumption constraint value and the maximum operating consumption includes: judging the operating consumption constraint If the value is equal to the maximum running consumption; if so, confirm that the difference between the running consumption of all subtasks corresponding to the starting index n to the judgment point T and the running consumption constraint value lies within the dichotomous target The left neighborhood closest to 0 in the model, confirm that the judgment point T is the end-point target index t; The absolute value of the difference between the running consumption and the running consumption constraint value is closest to 0, and it is confirmed that the judgment point T is the endpoint target index t.
- the confirming that the absolute value of the difference between the running consumption of all subtasks corresponding to the starting index n to the judgment point T and the running consumption constraint value is closest to 0 includes: Set the absolute value of the difference between the running consumption of all subtasks corresponding to the starting index n to the judgment point T and the running consumption constraint value as a, and set the starting index n to angle
- the absolute value of the difference between the running consumption of all subtasks corresponding to the mark T+1 and the running consumption constraint value is b, and the operation of all the subtasks corresponding to the starting mark n to the mark T-1 is set
- the absolute value of the difference between the consumption and the running consumption constraint value is c; confirm that a is less than or equal to b and a is less than or equal to c, then the running consumption of all subtasks corresponding to the starting index n to the judgment point T The absolute value of the difference between the amount and the running consumption constraint value is closest to zero.
- the left neighborhood at 0 includes: confirming that the running consumption of all subtasks corresponding to the starting index n to the judgment point T is less than or equal to the maximum running consumption, and the starting index n to The running consumption of all subtasks corresponding to the corner mark T+1 is greater than the maximum running consumption, then the running consumption of all the subtasks corresponding to the starting corner n to the judgment point T is equal to the running consumption
- the difference of the constraint value is located in the left neighborhood closest to 0 in the dichotomous target model.
- the present invention also provides an electronic device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the task deployment method described above.
- the present invention also provides a computer-readable storage medium on which program data is stored, and when the program data is executed by a processor, the above-mentioned task deployment method is realized.
- the present invention has the following beneficial effects:
- the task deployment method provided by the present invention divides the total task into several subtasks, and sets the operation consumption constraint value according to the operation consumption of each subtask and the number of FPGA boards, and divides the need to be deployed on the FPGA through the dichotomous iterative method.
- the multiple sub-tasks of achieve a more detailed split of the overall task, and make the multi-board FPGA heterogeneous system perform tasks with greater throughput, more balanced assembly lines between FPGA boards, and further improve the processing efficiency of unit hardware resources; and,
- the task deployment method provided by the present invention is suitable for any feedforward task that can be divided and divided, solves the defects of manual division and deployment tasks in the prior art, and has stronger versatility.
- Figure 1 is a schematic diagram of the traditional multi-board FPGA heterogeneous system structure
- Figure 2 is a schematic diagram of a pipelined multi-board FPGA heterogeneous system structure
- Figure 3 is a comparison diagram of the multi-cycle execution mode of the traditional multi-board FPGA heterogeneous system and the pipelined multi-board FPGA heterogeneous system;
- Figure 4 is a schematic diagram of traditional task division in a pipelined multi-board FPGA heterogeneous system
- FIG. 5 is a schematic flowchart of an embodiment of a task deployment method for a multi-board FPGA heterogeneous system according to the present invention
- Fig. 6 is a task split diagram of an embodiment in step S11 of the multi-board FPGA heterogeneous system of the present invention.
- FIG. 7 is a comparison diagram of the task division result of the traditional multi-board FPGA heterogeneous system and the task division result of the multi-board FPGA heterogeneous system of the present invention.
- Figure 8 is an overall flow chart of the multi-board FPGA heterogeneous system of the present invention.
- Fig. 9 is a flowchart of the dichotomous iterative process in Fig. 8.
- FIG. 10 is a schematic diagram of the task execution flow of the multi-board FPGA heterogeneous system of the present invention.
- FIG. 11 is a photographed diagram of experimental verification of a multi-board FPGA heterogeneous system according to the present invention.
- FIG. 12 is a schematic diagram of a framework of an embodiment of a computer-readable storage medium according to the present invention.
- system and "network” in this article are often used interchangeably in this article.
- the term “and/or” in this article is only an association relationship describing the associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A alone exists, A and B exist at the same time, exist alone B these three situations.
- the character "/” in this text generally indicates that the associated objects before and after are in an "or” relationship.
- "many” in this document means two or more than two.
- Multi-board FPGA heterogeneity is a method of cascading multiple hardware computing units and apportioning the calculation amount to these multiple computing units according to the task. Compared with CPU or GPGPU, it has better flexibility and lower energy consumption. It is more suitable for deploying deep learning inference algorithms that execute artificial neural network models.
- the traditional multi-board FPGA heterogeneous system structure shown in FIG. 1 is composed of a host device and multiple slave devices, and the host device and the slave device are interconnected through a PCIe bus.
- the host device is composed of one or more general-purpose CPUs and their memory
- the slave device is composed of an FPGA chip and device memory.
- the main working process of the above-mentioned traditional multi-board FPGA heterogeneous system is: the CPU core transfers the data required by the FPGA from the memory of the host device to the memory of the slave device through the PCIe bus, and starts the slave device for parallel data processing.
- the CPU core does not perform calculations or performs a small amount of calculations except for control; when the data processing of the slave device is completed, the result data is again transmitted to the host device through the PCIe bus. Therefore, the traditional multi-board FPGA heterogeneous system consumes a lot of time in long-distance communication and transmission of data.
- a pipelined multi-board FPGA heterogeneous system which also consists of a host device and multiple slave devices.
- the difference from the traditional multi-board FPGA heterogeneous system is that the host device of the pipelined multi-board FPGA heterogeneous system is composed of a CPU+FPGA heterogeneous system or SoC chip, and the slave device can be consistent with the host device for CPU+FPGA.
- Heterogeneous systems or SoC chips, or slave devices can also be all FPGA devices.
- FIG. 3 is a comparison diagram between the multi-cycle execution mode of the traditional multi-board FPGA heterogeneous system and the pipelined multi-board FPGA heterogeneous system.
- the throughput rate of the multi-cycle mode is The throughput rate of the pipelined execution method is
- the traditional task division is mostly based on the simple division and deployment of the surface characteristics of each subtask.
- the traditional task in the pipelined multi-board FPGA heterogeneous system shown in Figure 4 A schematic diagram of the division.
- LeNet the split of LeNet, only the total task is divided into several subtasks according to the convolutional layer, and all fully connected layers are deployed as one subtask on the same FPGA, resulting in heterogeneous multi-board FPGAs. There is a large imbalance in the system.
- FIG. 5 is a schematic flowchart of an embodiment of a task deployment method based on a multi-board FPGA heterogeneous system of the present invention, which specifically includes the following steps:
- FIG. 6 is a task split diagram of an embodiment of the multi-board FPGA heterogeneous system of the present invention.
- the vivado HLS software is used to comprehensively calculate the split tasks to obtain the required running time and resource occupancy of each subtask, and then obtain the running consumption of each subtask.
- the above-mentioned running consumption refers to the running delay, so the running consumption of each subtask refers to the running delay of each subtask.
- the running delay of each subtask is basically proportional to the calculation amount of the subtask, the running consumption of each subtask can also refer to the calculation amount of each subtask. .
- the setting of the operating consumption constraint value is to make a rough restriction or reference to the operating consumption that should be deployed on the FPGA. If the sum of the running consumption of multiple subtasks on the FPGA is as close as possible to the running consumption constraint value, a division is completed.
- S15 Deploy the subtask to be deployed on the FPGA of the subtask to be deployed.
- the task deployment method provided by the invention is suitable for any feedforward task that can be divided and divided, solves the defects of manual division and deployment tasks in the prior art, and has stronger versatility.
- step S13 the specific steps of determining the operating consumption constraint value in step S13 include:
- the first case is that if the current number of FPGA boards is greater than the rounded-up value of the quotient, it means that the number of FPGA boards currently available is sufficient, but the throughput rate and the balance of the pipeline between the boards are considered comprehensively. In fact, all FPGAs may not be completely used up.
- the operating consumption constraint value is the maximum operating consumption of several subtasks; the second case is that if the current FPGA board number is less than the quotient value, it means that The current number of FPGA boards is small, and all FPGAs are needed. At this time, the operating consumption constraint value is a quotient value.
- the first case can obtain a higher throughput rate than the second case, but the actual number of FPGA boards used is uncertain.
- the specific process of the dichotomous target model construction of the dichotomous iterative method in step S14 includes:
- the corner marker data is an arithmetic sequence with a tolerance of 1.
- the endpoint target index t of the subtask to be deployed on the FPGA of the subtask to be deployed is obtained.
- the above-mentioned index array is used as the independent variable of the dichotomous target model, and the starting index is marked to the running consumption of all subtasks corresponding to the independent variable.
- the difference between the sum and minus the operating consumption constraint value is used as the dependent variable of the dichotomous target model, so that the above-mentioned dichotomous target model constitutes a monotonically increasing discrete function, which conforms to the premise of using the dichotomous iterative method subsequently.
- the above steps obtain the endpoint target index t of the subtask to be deployed on the FPGA of the subtask to be deployed according to the binary target model and the starting index, and then include:
- obtaining the endpoint target index t of the subtask to be deployed on the FPGA of the subtask to be deployed according to the binary target model and the starting index in the above steps includes:
- the foregoing judging whether the point T is the endpoint target index t according to the size relationship between the operating consumption constraint value and the maximum operating consumption specifically includes:
- the absolute value of the difference between the running consumption of all subtasks corresponding to the starting index n to the judgment point T and the running consumption constraint value is closest to 0, including: setting the starting index n to The absolute value of the difference between the running consumption of all subtasks corresponding to point T and the running consumption constraint value is a, and the running consumption and running consumption of all subtasks corresponding to the starting index n to the index T+1 are set The absolute value of the difference between the constraint values is b.
- the difference between the running consumption and the running consumption constraint value of all the subtasks corresponding to the starting index n to the judgment point T is located in the left neighborhood closest to 0 in the binary target model, including: confirming The running consumption of all subtasks corresponding to the starting index n to the judgment point T is less than or equal to the maximum running consumption, and the running consumption of all the subtasks corresponding to the starting index n to the index T+1 is greater than the maximum running consumption Then, the difference between the running consumption and the running consumption constraint value of all the subtasks corresponding to the starting index n to the judgment point T is located in the left neighborhood closest to 0 in the dichotomous target model.
- the present invention gradually obtains the initial subtasks and the last subtasks deployed on each FPGA through the dichotomous iterative method, which achieves a more detailed split of the overall task, and enables the execution of multi-board FPGA heterogeneous systems.
- the task throughput rate is larger, the pipeline between FPGA boards is more balanced, and the processing efficiency of unit hardware resources is further improved; and the task deployment method provided by the present invention is suitable for any feedforward task that can be divided and divided, which solves the existing technology
- the deficiencies of manual division of deployment tasks in the middle, the versatility is stronger.
- Figure 7 is a comparison diagram of the task division result of the traditional multi-board FPGA heterogeneous system and the task division result pipeline of the multi-board FPGA heterogeneous system of the present invention, where a is the task division result of the traditional multi-board FPGA heterogeneous system. b is the task division result of the multi-board FPGA heterogeneous system of the present invention.
- FIG. 8 is an overall flowchart of the multi-board FPGA heterogeneous system of the present invention
- FIG. 9 is a flowchart of the dichotomous iterative process in FIG. 8. The overall process of the multi-board FPGA heterogeneous system of the present invention will be described in detail below with reference to Figs. 8 and 9:
- L(Mt) is the maximum running consumption among the running consumptions of several subtasks obtained by calculation.
- the overall hardware platform is composed of a master node PS (processing system) end and multiple slave nodes PL end.
- a master node PS processing system
- PL end On each node is a SoC FPGA.
- the master node is a node that directly communicates with the host computer, and is connected through the Ethernet port on the PS side.
- Multiple slave nodes are connected in sequence, and the data transmission between nodes uses the RapidIO protocol, and a high-speed serial transceiver is used for transmission and reception.
- the steps of the task deployment part mainly include:
- each FPGA The subtask deployment of each FPGA is performed according to the results of the aforementioned task division. During deployment, the sub-levels are merged into one sub-task. Due to the "barrel effect" in the pipeline execution method, the FPGA that consumes the most sub-tasks is used as a reference. After different sub-tasks, the transmission delay and inter-board transmission are increased. The way of adding no-operation delay and waiting (bubble) realizes the consistency of the running time of each FPGA, and achieves the balance of the pipeline. In addition, considering the resource occupation of subtasks in the FPGA, if the resource utilization is not high, the high-intensity parallel operations can be used to further optimize the commands such as array splitting, increasing internal pipelines, and loop unrolling to ensure higher levels. Resource utilization.
- the optical fiber connection method shortens the idle time of computing resources and improves the processing efficiency of resources.
- the delay is in the order of us, which is about two orders of magnitude less than the execution time of FPGA.
- the inter-board delay is added to The running front end of each slave equipment pipeline. Since the task has been split into multiple subtasks, the amount of communication between the boards will change. However, thanks to the high communication characteristics of the optical fiber, the delay between the boards is almost negligible. This feature makes the present invention need not consider the board. Time delay impact.
- FIG. 10 is a schematic diagram of the task execution flow of the multi-board FPGA heterogeneous system of the present invention, which specifically includes the following steps:
- the upper computer transmits the data to the DDR on the PS side of the master node through the Ethernet port to realize data buffering; the PL side sends the data in the DDR to the task processing IP core of the FPGA via the AXI bus; stores the processing results of the IP core to the PL side In BRAM; the SRIO core converts the data in the BRAM into the RapidIO protocol data packet format, and sends it to the next node through the optical fiber; receives the data packet from the node, and stores the original data in the BRAM after disassembly; reads it in the BRAM
- the result of the first stage is handed over to the IP core of the node to continue processing, and the result of this stage is transmitted to the next node through the optical fiber interface; when the last slave node is executed, the final result is returned to the master node; the upper computer can pass Ethernet Read out the result.
- the present invention uses four Xilinx Zynq 7035 series development boards for experimental verification, and the entire development process is based on the Vivado 2018.2 development platform environment.
- the task of the verification experiment is a convolutional neural network AlexNet with a computational load of hundreds of trillion MAC operations.
- the AlexNet network used contains 5 convolutional layers and saves all FC fully connected layers.
- the throughput rate is 19.12 sheets/s; while the traditional multi-FPGA pipeline method based on the convolutional layer or the FC layer, the throughput rate is 35.56 sheets/s; according to the present invention based on The multi-FPGA heterogeneous acceleration design method of task dichotomy has a throughput rate of up to 49.14 sheets/s.
- the throughput rate of the present invention is increased by 157%, and the resource utilization rate is increased by 61%; compared with the traditional pipeline method, the throughput rate is increased by 38.2%, and the resource utilization rate is increased by 17.56%.
- the present invention also provides a device including a memory and a processor coupled to each other, and the processor is configured to execute program instructions stored in the memory to implement the task deployment method described above.
- the present invention also provides a computer-readable storage medium on which program data is stored, and when the program data is executed by a processor, the above-mentioned task deployment method is implemented.
- the storage medium 60 stores program instructions 600 that can be executed by the processor, and the program instructions 600 are used to implement the task deployment method in any of the foregoing embodiments. That is, when the above task deployment method is implemented in the form of software and sold or used as an independent product, it can be stored in a storage device 60 that can be read by an electronic device.
- the storage device 60 can be a U disk, an optical disk, or a server.
- the disclosed method and device can be implemented in other ways.
- the device implementation described above is only illustrative, for example, the division of modules or units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of this embodiment.
- the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) or a processor execute all or part of the steps of the methods in the various embodiments of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. .
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Design And Manufacture Of Integrated Circuits (AREA)
Abstract
一种基于多板FPGA异构系统的任务部署方法,包括:将总任务划分为按照任务执行顺序排列的若干个子任务(S11);计算每一子任务的运行消耗量(S12);根据每一所述子任务的运行消耗量、以及多板FPGA异构系统的FPGA板数,确定多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值(S13),进而确定待部署在待部署子任务的FPGA上的子任务(S14);将待部署的子任务部署在待部署子任务的FPGA上(S15)。通过上述方式,使得多板FPGA异构系统执行任务的吞吐率更大,FPGA板间流水线更平衡,进一步提高了单位硬件资源的处理效率,通用性更强。
Description
本发明涉及异构计算技术领域,特别是涉及一种基于多板FPGA异构系统的任务部署方法及设备。
目前,在追求高算力和低功耗的深度学习推理模型下,多板FPGA(现场可编程门阵列)异构平台成为了一种新的探索目标及解决方案。
在采用流水线方案的多板FPGA异构系统中,总任务需拆分成多个子任务,以流水线的形式将各子任务划分部署在各个FPGA上。现有的任务划分方法多为根据各子任务的表层特征进行简单的拆分并划分部署,例如在卷积神经网络中,仅按照卷积层、全连接层的层数进行任务的拆分及划分部署,这就导致整个多板FPGA异构系统存在较大的不平衡和改进空间;并且由于上述方式是人工式的划分方法,不仅具有主观性和随意性,需要消耗时间和精力去验证,而且不能适用于其他执行任务的情况,当更改执行任务时,需要再次进行人工划分,缺乏通用性。
因此,为解决上述问题,必须提供一种新的基于多板FPGA异构系统的任务部署方法及设备。
发明内容
为实现上述目的,本发明提供了一种基于多板FPGA异构系统的任务部署方法,包括:将总任务划分为按照任务执行顺序排列的若干个子任务;计算每一所述子任务的运行消耗量;根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值;在使得部署在所述待部署子任务的FPGA上的子任务的运行消耗量之和接近 对应的所述运行消耗约束值的约束条件下,根据二分迭代法,从若干个所述子任务中,通过不断地把若干个所述子任务按照所述任务执行顺序一分为二,直至划分出的一部分子任务满足所述约束条件,以确定所述一部分子任务为待部署在所述待部署子任务的FPGA上的子任务;将待部署的所述子任务部署在所述待部署子任务的FPGA上。
作为本发明的进一步改进,所述根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值,包括:计算若干个所述子任务的运行消耗量的总和除以计算得到的运行消耗量中的最大运行消耗量,以得到商值;判断所述FPGA板数是否大于所述商值的向上取整值;若是,则确定所述运行消耗约束值为所述最大运行消耗量;若否,则确定所述运行消耗约束值为所述商值。
作为本发明的进一步改进,所述在使得部署在所述待部署子任务的FPGA上的子任务的运行消耗量之和接近对应的所述运行消耗约束值的约束条件下,根据二分迭代法,从若干个所述子任务中,通过不断地把若干个所述子任务按照所述任务执行顺序一分为二,直至划分出的一部分子任务满足所述约束条件,以确定所述一部分子任务为待部署在所述待部署子任务的FPGA上的子任务,包括:按照任务执行顺序设定若干个所述子任务的角标为以n为起始角标、m为末尾角标的角标数组;其中,所述角标数组为公差为1的等差数列;构造以所述角标数组为自变量的二分目标模型;其中,所述二分目标模型的因变量为所述起始角标至所述自变量对应的所有子任务的运行消耗量之和减去所述运行消耗约束值的差;根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t。
作为本发明的进一步改进,所述根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t,之后包括:循环执行指定操作,直至角标t+1至角标m的所有子任务的运行消耗量之和小于等于所述运行消耗约束值时,输出最后一次划分的端点目标角标t=m;其中,所述指定操作包括更新所述FPGA板数及 所述起始角标,并返回根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述待部署子任务的FPGA对应的运行消耗约束值的步骤,以更新所述运行消耗约束值。
作为本发明的进一步改进,所述根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t,包括:设定判断点T等于(m+n)/2的向下取整值;判断所述起始角标n至所述判断点T对应的所有所述子任务的运行消耗量的总和是否大于等于所述运行消耗约束值;若是,则所述端点目标角标t位于所述起始角标n至所述判断点T之间,更新所述判断点T等于(n+T)/2的向下取整值;若否,则所述端点目标角标t位于所述判断点T+1至所述末位角标m之间,更新所述判断点T等于(T+1+m)/2的向下取整值;根据所述运行消耗约束值与所述最大运行消耗量的大小关系判断所述判断点T是否为所述端点目标角标t;若是,则输出所述端点目标角标t=T;若否,则更新所述判断点T等于(n+T)/2的向下取整值,并返回所述判断所述起始角标n至所述判断点T对应的所有所述子任务的运行消耗量的总和是否大于等于所述运行消耗约束值的步骤。
作为本发明的进一步改进,所述根据所述运行消耗约束值与所述最大运行消耗量的大小关系判断所述判断点T是否为所述端点目标角标t,包括:判断所述运行消耗约束值是否等于所述最大运行消耗量;若是,则确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域,确认所述判断点T为所述端点目标角标t;若否,则确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0,确认所述判断点T为所述端点目标角标t。
作为本发明的进一步改进,所述确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0,包括:设定所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为a,设定所述起 始角标n至角标T+1对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为b,设定所述起始角标n至角标T-1对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为c;确认a小于等于b且a小于等于c,则所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0。
作为本发明的进一步改进,所述确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域,包括:确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量小于等于所述最大运行消耗量,且所述起始角标n至角标T+1对应的所有子任务的运行消耗量大于所述最大运行消耗量,则所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域。
本发明还提供了一种电子设备,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现上述所述的任务部署方法。
本发明还提供了一种计算机可读存储介质,其上存储有程序数据,所述程序数据被处理器执行时实现上述所述的任务部署方法。
与现有技术相比,本发明的有益效果在于:
本发明提供的任务部署方法,通过将总任务拆分成若干个子任务,并根据每一子任务的运行消耗量和FPGA板数设置运行消耗约束值,通过二分迭代法划分出需部署在FPGA上的多个子任务,实现了总任务更加细致的拆分,且使得多板FPGA异构系统执行任务的吞吐率更大,FPGA板间流水线更平衡,进一步提高了单位硬件资源的处理效率;并且,本发明提供的任务部署方法适用于任何可拆分划分的前馈任务,解决了现有技术中人工划分部署任务的缺陷,通用性更强。
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描 述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。其中:
图1为传统多板FPGA异构系统结构示意图;
图2为流水线式多板FPGA异构系统结构示意图;
图3为传统多板FPGA异构系统的多周期执行方式和流水线式多板FPGA异构系统的对比图;
图4为流水线式多板FPGA异构系统中传统任务划分的示意图;
图5为本发明多板FPGA异构系统的任务部署方法一实施例的流程示意图;
图6为本发明多板FPGA异构系统S11步骤中一实施方式的任务拆分图;
图7为传统多板FPGA异构系统的任务划分结果与本发明多板FPGA异构系统的任务划分结果流水线对比图
图8为本发明多板FPGA异构系统的整体流程图;
图9为图8中二分迭代过程的流程图;
图10为本发明多板FPGA异构系统的任务执行流程示意图;
图11为本发明多板FPGA异构系统实验验证拍摄图;
图12为本发明计算机可读存储介质一实施例的框架示意图。
下面结合说明书附图,对本申请实施例的方案进行详细说明。
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、接口、技术之类的具体细节,以便透彻理解本申请。
本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。此外,本文中的“多”表示两个或者多于两个。
多板FPGA异构是通过将多个硬件计算单元级联,根据任务的计算量分摊到这多个计算单元的方法,其相比于CPU或GPGPU具有更好的灵活性和更低的能耗比,且更适合部署执行人工神经网络模型的深度学习推理算法。
例如图1所示的传统多板FPGA异构系统结构,其由一个主机设备和多个从机设备构成,主机设备和从机设备通过PCIe总线进行互联。其中,主机设备由一个或多个通用CPU及其内存组成,从机设备由FPGA芯片和设备内存组成。上述传统多板FPGA异构系统的主要工作过程为:由CPU核心将FPGA所需数据从主机设备的内存通过PCIe总线传输至从机设备的内存中,并启动从机设备进行数据的并行处理,CPU核心除进行控制外不执行计算或执行部分少量的计算;当从机设备数据处理完成后,将结果数据再次通过PCIe总线传输至主机设备。因此,传统多板FPGA异构系统消耗大量的时间在数据的长程通信传输上。
为了改进传统多板FPGA异构系统通信传输消耗大量时间的问题,如图2所示,流水线式多板FPGA异构系统应运而生,其同样有一个主机设备和多个从机设备构成。与传统多板FPGA异构系统的不同之处在于流水线式多板FPGA异构系统的主机设备为CPU+FPGA的异构系统或SoC芯片组成,从机设备可以与主机设备一致为CPU+FPGA的异构系统或SoC芯片组成,或者从机设备也可以全为FPGA设备。在流水线式多板FPGA异构系统中,总任务需拆分为若干个子任务,若干个子任务以流水线的形式划分部署在各个FPGA上,相较于传统多板FPGA异构系统,流水线式多板FPGA异构系统能够极大地降低通信的需求,降低设备在单个任务执行时的通信等待时间,提高硬件资源的处理效率,同时提高了吞吐率。如图3所示,图3为传统多板FPGA异构系统的多周期执行方式和流水线式多板FPGA异构系统的对比图,其中,多周期方式的吞吐率为
流水线式执行方式的吞吐率为
在流水线式多板FPGA异构系统中,传统的任务划分多为根据各子任务的表层特征进行简单的拆分并划分部署,例如图4所示的流水线式多板FPGA异构系统中传统任务划分情况的示意图,在对LeNet的拆分 中仅将总任务按照卷积层划分为若干个子任务,并将全部全连接层作为一个子任务部署在同一个FPGA上,导致整个多板FPGA异构系统存在较大的不平衡。
为了提高流水线式多板FPGA异构系统中任务划分部署的均衡性,本发明提供了一种基于多板FPGA异构系统的任务部署方法。请参照图5,图5是本发明基于多板FPGA异构系统的任务部署方法一实施例的流程示意图,具体包括如下步骤:
S11:将总任务划分为按照任务执行顺序排列的若干个子任务。
具体地,在本步骤中,当确定总任务后,需在不破坏总任务内部结构的前提下将其尽可能多地拆分成若干个子任务。例如图6,图6为本发明多板FPGA异构系统一实施方式的任务拆分图。
S12:计算每一子任务的运行消耗量。
具体地,通过vivado HLS软件对拆分的任务进行综合计算,得出各个子任务所需的运行时间、资源占用情况等结果,进而得出每一子任务的运行消耗量。
需要说明的是,在一可选实施例中,上述运行消耗量指运行延迟,因此每一子任务的运行消耗量即指每一子任务的运行延迟。当然,在另一可选实施例中,由于每一子任务的运行延迟与该子任务的运算量基本成比例关系,故每一子任务的运行消耗量也可指每一子任务的运算量。
S13:根据每一子任务的运行消耗量、以及多板FPGA异构系统的FPGA板数,确定多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值。
S14:在使得部署在待部署子任务的FPGA上的子任务的运行消耗量之和接近对应的运行消耗约束值的约束条件下,根据二分迭代法,从若干个子任务中,通过不断地把若干个子任务按照任务执行顺序一分为二,直至划分出的一部分子任务满足约束条件,以确定一部分子任务为待部署在待部署子任务的FPGA上的子任务。
在本步骤中,运行消耗约束值的设定是为了对FPGA上应该部署的运行消耗量进行一个大致的约束或参考。若当前FPGA上的多个子任务 的运行消耗量之和尽可能地接近运行消耗约束值时,即完成了一次划分。
S15:将待部署的子任务部署在待部署子任务的FPGA上。
通过上述方式,实现了总任务更加细致的拆分,且使得多板FPGA异构系统执行任务的吞吐率更大,FPGA板间流水线更平衡,进一步提高了单位硬件资源的处理效率;并且,本发明提供的任务部署方法适用于任何可拆分划分的前馈任务,解决了现有技术中人工划分部署任务的缺陷,通用性更强。
在一具体实施方式中,S13步骤中确定运行消耗约束值的具体步骤包括:
计算若干个子任务的运行消耗量的总和除以计算得到的运行消耗量中的最大运行消耗量,以得到商值;判断FPGA板数是否大于商值的向上取整值;若是,则确定运行消耗约束值为最大运行消耗量;若否,则确定运行消耗约束值为商值。
具体地,在本步骤中,第一种情况是,若当前FPGA板数大于商值的向上取整值,即说明当前可用的FPGA板数充足,但综合考虑吞吐率及板间流水线的平衡问题,实际不一定会完全用完所有FPGA,此时运行消耗约束值为若干个子任务中的最大运行消耗量;第二种情况是,若当前FPGA板数小于商值的向上取整值,即说明当前FPGA板数较少,需要用到全部的FPGA,此时运行消耗约束值为商值。其中,第一种情况可获得比第二种情况更高的吞吐率,但实际用到的FPGA板数是不确定的。
在一具体实施方式中,S14步骤中二分迭代法的二分目标模型构造的具体过程包括:
首先,按照任务执行顺序设定若干个子任务的角标为以n为起始角标、m为末尾角标的角标数组;其中,角标数据为公差为1的等差数列。接着,构造以角标数组为自变量的二分目标模型;其中,二分目标模型的因变量为起始角标至自变量对应的所有子任务的运行消耗量之和减去运行消耗约束值的差;最后,根据二分目标模型及起始角标获得待部署子任务的FPGA上需部署的子任务的端点目标角标t。
需要说明的是,由于各子任务的运行消耗量均为正数,因此将上述角标数组作为二分目标模型的自变量,将起始角标至自变量对应的所有子任务的运行消耗量之和减去运行消耗约束值的差作为二分目标模型的因变量,如此使得上述二分目标模型构成了单调递增的离散函数,从而符合后续使用二分迭代法的前提。
进一步,由于在每次单次划分后的结果使得运行消耗约束值存在偏差,因此在每次单次划分后需要不断迭代更新运行消耗约束值。具体地,在一实施方式中,上述步骤中根据二分目标模型及起始角标获得待部署子任务的FPGA上需部署的子任务的端点目标角标t,之后包括:
循环执行指定操作,直至角标t+1至角标m的所有子任务的运行消耗量之和小于等于运行消耗约束值时,输出最后一次划分的端点目标角标t=m;其中,指定操作包括更新FPGA板数及起始角标,并返回S13步骤,以更新运行消耗约束值。
在一实施方式中,上述步骤中根据所述二分目标模型及所述起始角标获得待部署子任务的FPGA上需部署的子任务的端点目标角标t,包括:
首先,设定判断点T等于(m+n)/2的向下取整值;而后,判断起始角标n至判断点T对应的所有子任务的运行消耗量的总和是否大于等于运行消耗约束值;若是,则端点目标角标t位于起始角标n至判断点T之间,更新判断点T等于(n+T)/2的向下取整值;若否,则端点目标角标t位于判断点T+1至末位角标m之间,更新判断点T等于(T+1+m)/2的向下取整值;最后,根据运行消耗约束值与最大运行消耗量的大小关系判断判断点T是否为所述端点目标角标t;若是,则输出端点目标角标t=T;若否,则更新判断点T等于(n+T)/2的向下取整值,并返回上述判断起始角标n至判断点T对应的所有子任务的运行消耗量的综合是否大于等于运行消耗约束值的步骤。
进一步地,上述根据运行消耗约束值与最大运行消耗量的大小关系判断判断点T是否为端点目标角标t,具体包括:
判断运行消耗约束值是否等于最大运行消耗量;若是,则确认起始 角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差位于二分目标模型中最接近于0的左邻域,确认判断点T为端点目标角标t;若否,则确认起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值最接近0,确认判断点T为端点目标角标t。
在一实施方式中,上述确认起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值最接近0,包括:设定起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值为a,设定起始角标n至角标T+1对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值为b,设定起始角标n至角标T-1对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值为c;确认a小于等于b且a小于等于c,则起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差的绝对值最接近0。
在一实施方式中,上述确认起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差位于二分目标模型中最接近于0的左邻域,包括:确认起始角标n至判断点T对应的所有子任务的运行消耗量小于等于最大运行消耗量,且起始角标n至角标T+1对应的所有子任务的运行消耗量大于最大运行消耗量,则起始角标n至判断点T对应的所有子任务的运行消耗量与运行消耗约束值之差位于二分目标模型中最接近于0的左邻域。
由此,本发明通过二分迭代法逐步求得每个FPGA上开始部署的起始子任务及最后部署的末尾子任务,实现了总任务更加细致的拆分,且使得多板FPGA异构系统执行任务的吞吐率更大,FPGA板间流水线更平衡,进一步提高了单位硬件资源的处理效率;并且,本发明提供的任务部署方法适用于任何可拆分划分的前馈任务,解决了现有技术中人工划分部署任务的缺陷,通用性更强。例如图7,图7为传统多板FPGA异构系统的任务划分结果与本发明多板FPGA异构系统的任务划分结果流水线对比图,其中a为传统多板FPGA异构系统的任务划分结果,b为本发明多板FPGA异构系统的任务划分结果。
为了方便理解,请参阅图8-图9,图8为本发明多板FPGA异构系统的整体流程图,图9为图8中二分迭代过程的流程图。以下结合图8、图9对本发明多板FPGA异构系统的整体流程进行详细描述:
首先,按照任务执行顺序排列好m个子任务,分别为M
1、M
2、M
3……M
m,对应地,设定每一子任务的运行消耗量为L(M
i)(单位ms),按照任务执行顺序排列好L(M
i),此时将按照任务执行顺序排列好的多个L(M
i)整体称为关于L(M
i)的数组M,例如假设子任务为3个,且按照任务执行顺序依次为M
1、M
2、M
3,则数组M即为:L(M
1)、L(M
2)、L(M
3)。
程序开始,输入数组M及FPGA板数K,并初始化n=1,此时进行运行消耗约束值的设定,即判断下述公式是否成立:
其中,L(Mt)为计算得到的若干个子任务的运行消耗量中的最大运行消耗量。
若是,则说明可用的FPGA板数目充足,此时可能存在子任务部署完成后每个FPGA板的运行消耗量均较小的情况,因此设定运行消耗约束值为LM=L(Mt),以通过该运行消耗约束值的设定增加每个FPGA板需部署子任务的量,进而提高每个FPGA板的资源利用率;
若否,则说明可用的FPGA板数目较少,此时即使使用全部的FPGA板,也可能会造成运行消耗量过大,因此运行消耗约束值的设定采用下述公式,以通过该运行消耗约束值的设定减小每个FPGA板需部署子任务的量,进而平衡流水线的运行消耗量:
接着,如图6所示,进入二分迭代部分的子程序以输出端点目标角标t:
(2)判断起始角标n至判断点T对应的所有子任务的运行消耗量 的总和是否大于等于运行消耗约束值;若是,则端点目标角标t在n至T之间;若否,则端点目标角标t在T+1至m之间;并按照图9所示流程图更新T值;
(3)继续判断运行消耗约束值与所述最大运行消耗量的大小关系,即判断LM是否等于L(M
T);若是,则说明可用的FPGA板数目充足,此时判断
是否位于最接近于0的左邻域,是则跳转至(4),否则跳转至(5);若否,则说明当前实际可用的FPGA板数目较少,此时继续判断
是否最接近0,是则跳转至(4),否则跳转至(5);
(4)输出端点目标角标t=T,子程序结束;
当二分迭代部分的子程序执行完成后,判断
是否成立,若是则输出最后一个划分结果t=m,整个任务划分程序结束;若否,则更新K=K-1,n=t+1,并返回至运行消耗约束值的设置步骤以重新设定运行消耗约束值。
除上述任务划分之外,下面主要针对任务部署及任务执行两部分进行详细描述:
整体的硬件平台由一个主节点PS(处理系统)端和多个从节点PL端组合而成。每个节点上是一块SoC FPGA。其中主节点是与上位机直接通信的节点,通过PS端的以太网口进行连接。多个从节点按顺序连接,节点间的数据传输使用RapidIO协议,高速串行收发器进行收发。任务部署部分的步骤主要包括:
按照前述任务划分的结果进行各个FPGA的子任务部署。部署时将各子层级合并为一个子任务,由于流水线执行方式存在“木桶效应”,因此以子任务的运行消耗量最大的FPGA为参考,在不同的子任务后通过增加板间传输延迟和加空操作延迟等待(bubble)的方式实现各FPGA运行时间的一致,达到流水线平衡。另外考虑到子任务在FPGA中资源占用的情况,如果资源利用率不高,可通过并行运算强度高的部分,如进行数组拆分,增加内部流水线,循环展开等命令进一步优化,以保证 较高的资源利用率。
配置各节点的比特流文件,实现子任务的执行和节点间数据传输通路。对划分后的各部分IP核进行综合,得出硬件资源报表和运行时钟周期。将各部分的IP核添加到工程并将整个比特流文件烧写到相应FPGA,配置各部分的SDK(软件开发套件)驱动程序,建立起板内的数据通路和对外GTX(吉比特收发器)高速串行接口。
进行物理连接和调试,并给FPGA上电,测试相应功能。连接时将主节点的以太网口与上位机连接,将FPGA依次用光纤连接,进行物理通路测试与调试。
需要说明的是,光纤连接的方式在保证吞吐率的同时,缩短了计算资源的空闲时间,提高了资源的处理效率。此外考虑到节点间数据传输延时的存在,由于设备之间连接使用的是万兆光纤,延时为us量级,比FPGA的执行时间少大约两个量级,将板间延迟考虑加入到各个从机设备流水线的运行前端。由于任务已经极多地拆分成了多个子任务,板间的通信量会发生变化,但是得益于光纤的高通信特性,使得板间延迟几乎可以忽略不计,该特点使得本发明无需考虑板间延迟影响。
将连接好的各个FPGA整体运行。将待处理数据发送到该平台,处理完毕后将数据通过以太网口返回给上位机。
如图10所示,图10为本发明多板FPGA异构系统的任务执行流程示意图,具体包括如下步骤:
上位机将数据通过以太网口传输到主节点PS端的DDR中,实现数据缓冲;PL端将DDR中的数据通过AXI总线发送到该FPGA的任务处理IP核;将IP核处理结果存储到PL端的BRAM中;SRIO核将BRAM中数据转化为RapidIO协议数据包的格式,并通过光纤发送到下一节点;从节点接收数据包,拆解后将原始数据存入BRAM中;在BRAM中读取上一阶段结果,交由该节点IP核继续处理,并将本阶段结果通过光纤接口传输给下一个节点;当最后一个从节点执行完毕后,将最终结果返还给主节点;上位机可以通过以太网口读出此结果。
为了验证本发明的效果,如图11所示,本发明使用四块Xilinx Zynq 7035系列开发板进行了实验验证,整个开发过程基于Vivado 2018.2开发平台环境。验证实验的任务是运算量达几百兆次MAC操作的卷积神经网络AlexNet。所用AlexNet网络包含5层卷积层并省去全部的FC全连接层。按照非板间流水线的多周期方法吞吐率为19.12张/s;而传统的以卷积层或FC层为划分依据的多FPGA流水线方法,吞吐率为35.56张/s;按照本发明提出的基于任务二分法的多FPGA异构加速设计方法吞吐率高达49.14张/s。本发明吞吐率比多周期方法提高157%,资源利用率提高61%;比传统流水线方法提高38.2%,资源利用率提升17.56%。
本发明还提供了一种设备,包括相互耦接的存储器和处理器,处理器用于执行存储器中存储的程序指令,以实现上述所述的任务部署方法。
如图12所示,本发明还提供一种计算机可读存储介质,其上存储有程序数据,程序数据被处理器执行时实现上述所述的任务部署方法。该存储介质60存储有能够被处理器运行的程序指令600,程序指令600用于实现上述任一实施例中的任务部署方法。即上述任务部署方法以软件形式实现并作为独立的产品销售或使用时,可存储在一个电子设备可读取的存储装置60中,该存储装置60可以是U盘、光盘或者服务器等。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和装置,可以通过其它的方式实现。例如,以上所描述的装置实施方式仅仅是示意性的,例如,模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施方式方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
Claims (10)
- 一种基于多板FPGA异构系统的任务部署方法,其特征在于,包括:将总任务划分为按照任务执行顺序排列的若干个子任务;计算每一所述子任务的运行消耗量;根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值;在使得部署在所述待部署子任务的FPGA上的子任务的运行消耗量之和接近对应的所述运行消耗约束值的约束条件下,根据二分迭代法,从若干个所述子任务中,通过不断地把若干个所述子任务按照所述任务执行顺序一分为二,直至划分出的一部分子任务满足所述约束条件,以确定所述一部分子任务为待部署在所述待部署子任务的FPGA上的子任务;将待部署的所述子任务部署在所述待部署子任务的FPGA上。
- 根据权利要求1所述的任务部署方法,其特征在于,所述根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述多板FPGA异构系统中待部署子任务的FPGA对应的运行消耗约束值,包括:计算若干个所述子任务的运行消耗量的总和除以计算得到的运行消耗量中的最大运行消耗量,以得到商值;判断所述FPGA板数是否大于所述商值的向上取整值;若是,则确定所述运行消耗约束值为所述最大运行消耗量;若否,则确定所述运行消耗约束值为所述商值。
- 根据权利要求2所述的任务部署方法,其特征在于,所述在使得部署在所述待部署子任务的FPGA上的子任务的运行消耗量之和接近对应的所述运行消耗约束值的约束条件下,根据二分迭代法,从若干个所述子任务中,通过不断地把若干个所述子任务按照所述任务执行顺序一分为二,直至划分出的一部分子任务满足所述约束条件,以确定所述 一部分子任务为待部署在所述待部署子任务的FPGA上的子任务,包括:按照任务执行顺序设定若干个所述子任务的角标为以n为起始角标、m为末尾角标的角标数组;其中,所述角标数组为公差为1的等差数列;构造以所述角标数组为自变量的二分目标模型;其中,所述二分目标模型的因变量为所述起始角标至所述自变量对应的所有子任务的运行消耗量之和减去所述运行消耗约束值的差;根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t。
- 根据权利要求3所述的任务部署方法,其特征在于,所述根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t,之后包括:循环执行指定操作,直至角标t+1至角标m的所有子任务的运行消耗量之和小于等于所述运行消耗约束值时,输出最后一次划分的端点目标角标t=m;其中,所述指定操作包括更新所述FPGA板数及所述起始角标,并返回根据每一所述子任务的运行消耗量、以及所述多板FPGA异构系统的FPGA板数,确定所述待部署子任务的FPGA对应的运行消耗约束值的步骤,以更新所述运行消耗约束值。
- 根据权利要求3所述的任务部署方法,其特征在于,所述根据所述二分目标模型及所述起始角标获得所述待部署子任务的FPGA上需部署的子任务的端点目标角标t,包括:设定判断点T等于(m+n)/2的向下取整值;判断所述起始角标n至所述判断点T对应的所有所述子任务的运行消耗量的总和是否大于等于所述运行消耗约束值;若是,则所述端点目标角标t位于所述起始角标n至所述判断点T之间,更新所述判断点T等于(n+T)/2的向下取整值;若否,则所述端点目标角标t位于所述判断点T+1至所述末位角标m之间,更新所述判断点T等于(T+1+m)/2的向下取整值;根据所述运行消耗约束值与所述最大运行消耗量的大小关系判断 所述判断点T是否为所述端点目标角标t;若是,则输出所述端点目标角标t=T;若否,则更新所述判断点T等于(n+T)/2的向下取整值,并返回所述判断所述起始角标n至所述判断点T对应的所有所述子任务的运行消耗量的总和是否大于等于所述运行消耗约束值的步骤。
- 根据权利要求5所述的任务部署方法,其特征在于,所述根据所述运行消耗约束值与所述最大运行消耗量的大小关系判断所述判断点T是否为所述端点目标角标t,包括:判断所述运行消耗约束值是否等于所述最大运行消耗量;若是,则确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域,确认所述判断点T为所述端点目标角标t;若否,则确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0,确认所述判断点T为所述端点目标角标t。
- 根据权利要求6所述的任务部署方法,其特征在于,所述确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0,包括:设定所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为a,设定所述起始角标n至角标T+1对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为b,设定所述起始角标n至角标T-1对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值为c;确认a小于等于b且a小于等于c,则所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差的绝对值最接近0。
- 根据权利要求6所述的任务部署方法,其特征在于,所述确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域,包 括:确认所述起始角标n至所述判断点T对应的所有子任务的运行消耗量小于等于所述最大运行消耗量,且所述起始角标n至角标T+1对应的所有子任务的运行消耗量大于所述最大运行消耗量,则所述起始角标n至所述判断点T对应的所有子任务的运行消耗量与所述运行消耗约束值之差位于所述二分目标模型中最接近于0的左邻域。
- 一种电子设备,其特征在于,包括相互耦接的存储器和处理器,所述处理器用于执行所述存储器中存储的程序指令,以实现权利要求1-8中任一项所述的任务部署方法。
- 一种计算机可读存储介质,其上存储有程序数据,其特征在于,所述程序数据被处理器执行时实现权力要求1-8中任一项所述的任务部署方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010394248.6A CN111736966B (zh) | 2020-05-11 | 2020-05-11 | 基于多板fpga异构系统的任务部署方法及设备 |
CN202010394248.6 | 2020-05-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021227418A1 true WO2021227418A1 (zh) | 2021-11-18 |
Family
ID=72647085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129554 WO2021227418A1 (zh) | 2020-05-11 | 2020-11-17 | 基于多板fpga异构系统的任务部署方法及设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111736966B (zh) |
WO (1) | WO2021227418A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204236A (zh) * | 2023-04-27 | 2023-06-02 | 深圳艾为电气技术有限公司 | 基于模板的ptc驱动器配置方法及装置 |
CN118014022A (zh) * | 2024-01-29 | 2024-05-10 | 中国人民解放军陆军炮兵防空兵学院 | 面向深度学习的fpga通用异构加速方法及设备 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111736966B (zh) * | 2020-05-11 | 2022-04-19 | 深圳先进技术研究院 | 基于多板fpga异构系统的任务部署方法及设备 |
CN113485818A (zh) * | 2021-08-03 | 2021-10-08 | 北京八分量信息科技有限公司 | 异构任务的调度方法、装置及相关产品 |
CN114138481A (zh) * | 2021-11-26 | 2022-03-04 | 浪潮电子信息产业股份有限公司 | 一种数据处理方法、装置及介质 |
CN115543908B (zh) * | 2022-11-28 | 2023-03-28 | 成都航天通信设备有限责任公司 | 基于FPGA的Aurora总线数据交互系统 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100146503A1 (en) * | 2008-12-10 | 2010-06-10 | Institute For Information Industry | Scheduler of virtual machine module, scheduling method therefor, and device containing computer software |
CN104123190A (zh) * | 2014-07-23 | 2014-10-29 | 浪潮(北京)电子信息产业有限公司 | 异构集群系统的负载均衡方法和装置 |
CN104598310A (zh) * | 2015-01-23 | 2015-05-06 | 武汉理工大学 | 基于fpga部分动态可重构技术模块划分的低功耗调度方法 |
CN106874158A (zh) * | 2017-01-11 | 2017-06-20 | 广东工业大学 | 一种异构系统全程序功耗计量方法 |
CN108563808A (zh) * | 2018-01-05 | 2018-09-21 | 中国科学技术大学 | 基于fpga的异构可重构图计算加速器系统的设计方法 |
CN110704360A (zh) * | 2019-09-29 | 2020-01-17 | 华中科技大学 | 一种基于异构fpga数据流的图计算优化方法 |
CN111736966A (zh) * | 2020-05-11 | 2020-10-02 | 深圳先进技术研究院 | 基于多板fpga异构系统的任务部署方法及设备 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103810137A (zh) * | 2014-01-07 | 2014-05-21 | 南京大学 | 一种基于多fpga平台的ncs算法并行化的方法 |
CN107122243B (zh) * | 2017-04-12 | 2018-07-24 | 浙江远算云计算有限公司 | 用于cfd仿真计算的异构集群系统及计算cfd任务的方法 |
-
2020
- 2020-05-11 CN CN202010394248.6A patent/CN111736966B/zh active Active
- 2020-11-17 WO PCT/CN2020/129554 patent/WO2021227418A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100146503A1 (en) * | 2008-12-10 | 2010-06-10 | Institute For Information Industry | Scheduler of virtual machine module, scheduling method therefor, and device containing computer software |
CN104123190A (zh) * | 2014-07-23 | 2014-10-29 | 浪潮(北京)电子信息产业有限公司 | 异构集群系统的负载均衡方法和装置 |
CN104598310A (zh) * | 2015-01-23 | 2015-05-06 | 武汉理工大学 | 基于fpga部分动态可重构技术模块划分的低功耗调度方法 |
CN106874158A (zh) * | 2017-01-11 | 2017-06-20 | 广东工业大学 | 一种异构系统全程序功耗计量方法 |
CN108563808A (zh) * | 2018-01-05 | 2018-09-21 | 中国科学技术大学 | 基于fpga的异构可重构图计算加速器系统的设计方法 |
CN110704360A (zh) * | 2019-09-29 | 2020-01-17 | 华中科技大学 | 一种基于异构fpga数据流的图计算优化方法 |
CN111736966A (zh) * | 2020-05-11 | 2020-10-02 | 深圳先进技术研究院 | 基于多板fpga异构系统的任务部署方法及设备 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204236A (zh) * | 2023-04-27 | 2023-06-02 | 深圳艾为电气技术有限公司 | 基于模板的ptc驱动器配置方法及装置 |
CN116204236B (zh) * | 2023-04-27 | 2023-09-29 | 深圳艾为电气技术有限公司 | 基于模板的ptc驱动器配置方法及装置 |
CN118014022A (zh) * | 2024-01-29 | 2024-05-10 | 中国人民解放军陆军炮兵防空兵学院 | 面向深度学习的fpga通用异构加速方法及设备 |
Also Published As
Publication number | Publication date |
---|---|
CN111736966B (zh) | 2022-04-19 |
CN111736966A (zh) | 2020-10-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021227418A1 (zh) | 基于多板fpga异构系统的任务部署方法及设备 | |
WO2021115052A1 (zh) | 一种异构芯片的任务处理方法、任务处理装置及电子设备 | |
US8782591B1 (en) | Physically aware logic synthesis of integrated circuit designs | |
CN107526645A (zh) | 一种通信优化方法及系统 | |
US11748548B2 (en) | Hierarchical clock tree implementation | |
CN104780213A (zh) | 一种主从分布式图处理系统负载动态优化方法 | |
US12056085B2 (en) | Determining internodal processor interconnections in a data-parallel computing system | |
US10255399B2 (en) | Method, apparatus and system for automatically performing end-to-end channel mapping for an interconnect | |
CN111049900B (zh) | 一种物联网流计算调度方法、装置和电子设备 | |
CN118014022A (zh) | 面向深度学习的fpga通用异构加速方法及设备 | |
WO2020155083A1 (zh) | 神经网络的分布式训练方法及装置 | |
CN116670660A (zh) | 片上网络的仿真模型生成方法、装置、电子设备及计算机可读存储介质 | |
CN113139650B (zh) | 深度学习模型的调优方法和计算装置 | |
US20210133579A1 (en) | Neural network instruction streaming | |
CN106897137B (zh) | 一种基于虚拟机热迁移的物理机与虚拟机映射转换方法 | |
CN115250251B (zh) | 片上网络仿真中的传输路径规划方法、装置、电子设备及计算机可读存储介质 | |
US9892227B1 (en) | Systems, methods and storage media for clock tree power estimation at register transfer level | |
CN114138484A (zh) | 资源分配方法、装置以及介质 | |
US7110928B1 (en) | Apparatuses and methods for modeling shared bus systems | |
CN103019743A (zh) | 一种模块化的信号处理流图与多处理器硬件平台建模方法 | |
CN115345100A (zh) | 片上网络仿真模型及动态路径规划方法、装置、多核芯片 | |
KR20200144462A (ko) | 계층적 클럭 트리 구현 | |
KR101726663B1 (ko) | 블록 다이어그램 환경에서의 모델 기반 설계 시 클라우드 개념을 이용한 각 모델 간 입출력 포트 연결 자동화 방법 및 시스템 | |
CN109710314A (zh) | 一种基于图结构分布式并行模式构建图的方法 | |
US8769449B1 (en) | System level circuit design |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20935727 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 170423) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20935727 Country of ref document: EP Kind code of ref document: A1 |