WO2021097962A1 - 一种异构芯片的任务处理方法、任务处理装置及电子设备 - Google Patents
一种异构芯片的任务处理方法、任务处理装置及电子设备 Download PDFInfo
- Publication number
- WO2021097962A1 WO2021097962A1 PCT/CN2019/124350 CN2019124350W WO2021097962A1 WO 2021097962 A1 WO2021097962 A1 WO 2021097962A1 CN 2019124350 W CN2019124350 W CN 2019124350W WO 2021097962 A1 WO2021097962 A1 WO 2021097962A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- subtasks
- task
- subtask
- pipeline
- single task
- Prior art date
Links
- 238000012545 processing Methods 0.000 title claims abstract description 70
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 47
- 238000006243 chemical reaction Methods 0.000 claims description 41
- 238000004590 computer program Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000005540 biological transmission Effects 0.000 claims description 11
- 238000005192 partition Methods 0.000 claims description 9
- 238000001914 filtration Methods 0.000 claims description 7
- 238000000638 solvent extraction Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 8
- 238000011176 pooling Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 5
- 238000012216 screening Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
- G06F9/3869—Implementation aspects, e.g. pipeline latches; pipeline synchronisation and clocking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Definitions
- This application belongs to the technical field of high-performance computing, and in particular relates to a task processing method, a task processing device, electronic equipment, and a computer-readable storage medium for heterogeneous chips.
- a multi-board heterogeneous many-core includes a host device and multiple accelerator devices, and each device is interconnected through a bus.
- hardware resources are allocated according to the scale of the computing task, and each device executes a single-cycle task, so that most of the processing resources are in the waiting stage when the device is running, which reduces the operating efficiency of the computing device to a certain extent.
- this application provides a heterogeneous chip data processing method, data processing device, electronic equipment, and computer-readable storage medium, which can greatly reduce the waiting time of processing resources when processing tasks and improve hardware resources The processing efficiency.
- this application provides a task processing method for heterogeneous chips, including:
- this application provides a task processing device for heterogeneous chips, including:
- the receiving module is used to receive the execution instruction of a single task
- the dividing module is used to divide the above single task into at least two subtasks in sequence;
- a distribution module for distributing each subtask to different computing chips in the heterogeneous chip
- the processing module is used to control the above-mentioned different computing chips to sequentially process the above-mentioned subtasks in the first pipeline mode, wherein the first pipeline stage corresponding to the first pipeline mode is the same as the number of subtasks, and the number of the above-mentioned first pipeline
- the first-level computing time includes the execution time of a subtask and the time for transmitting the data corresponding to the subtask between two adjacent computing chips.
- the present application provides an electronic device, including a memory, a processor, and a computer program stored in the foregoing memory and capable of running on the foregoing processor.
- the foregoing processor executes the foregoing computer program, the foregoing first aspect is implemented. The method provided.
- the present application provides a computer-readable storage medium.
- the above-mentioned computer-readable storage medium stores a computer program, and when the above-mentioned computer program is executed by a processor, the method provided in the first aspect is implemented.
- the present application provides a computer program product, which when the computer program product runs on an electronic device, causes the electronic device to execute the method provided in the above-mentioned first aspect.
- the execution instruction of a single task is first received; then the single task is divided into at least two subtasks in sequence, and each subtask is distributed to different computing chips in the heterogeneous chip. ; Finally, the different computing chips are controlled to sequentially process the subtasks in the first pipeline mode, wherein the number of first pipeline stages corresponding to the first pipeline mode is the same as the number of subtasks, and the first-stage operation of the first pipeline
- the time includes the execution time of a subtask and the time for transmitting the data corresponding to the aforementioned subtask between two adjacent computing chips.
- the computing chip can immediately start processing the subtasks of the next task after processing the subtasks of one task, which greatly reduces the waiting time of processing resources when processing tasks and improves the processing efficiency of hardware resources.
- FIG. 1 is a schematic flowchart of a task processing method for a heterogeneous chip provided by an embodiment of the present application
- FIG. 2 is a schematic diagram of task processing of a pipeline provided by an embodiment of the present application
- FIG. 3 is an example diagram of data interaction between an electronic device and a heterogeneous chip provided by an embodiment of the present application
- FIG. 4 is a schematic structural diagram of a task processing device for a heterogeneous chip provided by an embodiment of the present application
- Fig. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
- the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
- the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
- FIG. 1 shows a flowchart of a task processing method for a heterogeneous chip provided by an embodiment of the present application, and the details are as follows:
- Step 101 Receive an execution instruction of a single task
- the task processing method of the heterogeneous chip is applied to an electronic device, which has a general-purpose processor function.
- the general-purpose processor executes the steps of the above task processing method, it controls the processing tasks of each computing chip in the above heterogeneous chips.
- the above tasks include data to be processed (such as images, texts, videos) and programs for processing the data.
- the task execution instruction is issued or triggered by the user. When the user sends the execution instruction to the electronic device, the electronic device starts to execute the steps of the task processing method described above.
- Step 102 Divide the foregoing single task into at least two subtasks in a sequential order
- each subtask includes a part of the foregoing program, and the execution of each subtask has a sequence, and the foregoing sequence is the execution sequence of the foregoing program.
- Each subtask is executed in sequence to complete the processing of the data to be processed.
- the convolutional neural network includes a first convolutional layer, a first pooling layer, a second convolutional layer, and a second pooling that are sequentially connected. Layer and three-layer fully connected layer. After dividing the convolutional neural network, three subtasks are obtained.
- the first subtask includes processing the data to be processed through the first convolutional layer and the first pooling layer
- the second subtask includes processing the data to be processed through the second convolutional layer and the second convolutional layer.
- the second pooling layer processes the data to be processed
- the third subtask includes processing the data to be processed through the three-layer fully connected layer, and sequentially executes the first, second, and third subtasks above to complete the input To the processing of the data of the convolutional neural network.
- the execution time of each subtask obtained after task division is equal. If the execution time of each subtask is not equal in the process of task division, a delay should be added after the execution of each subtask, so that the execution time of each subtask after the delay is added is equal. Further, the way to add a delay after the execution of each subtask may be: find the subtask with the longest execution time from the subtasks with unequal execution time, and do not add the delay after the execution of the subtask with the longest execution time. When the delay is added to other subtasks, the execution time becomes equal to the execution time of the subtask with the longest execution time.
- step 102 specifically includes:
- the single task is divided into at least two subtasks in sequence.
- the hardware resource information of the aforementioned heterogeneous chips includes the number of computing chips.
- CNN network convolutional neural network
- the number of determined subtasks is N, and m basic units need to be divided into N parts.
- the number of subtasks can be determined in the following way: Take the above CNN network as an example, first divide the above CNN network into multiple units, each unit is a layer in the CNN network, which is the CNN network The smallest indivisible unit in the middle. From multiple units, select the unit with the longest execution time. Calculate the ratio of the total execution time of the CNN network to the execution time of the unit with the largest execution time. Compare the number of computing chips in the above heterogeneous chips with the ratio. If the number of computing chips is greater than the ratio, the number of subtasks after division is the ratio rounded up. If the number of computing chips is less than With this ratio, the number of divided subtasks is the number of computing chips.
- the number of subtasks obtained cannot be greater than the number of computing chips to ensure that each subtask is executed by one computing chip.
- the output obtained from the execution of the previous subtask is sent to the next subtask as the input of the next subtask, and the input of the first subtask is the data to be processed included in the task.
- the output of the last subtask is the execution result of the last task.
- step A1 specifically includes:
- multiple division schemes are determined according to the execution time of the aforementioned single task and the number of computing chips in the aforementioned heterogeneous chip.
- the number of division schemes depends on the number of computing chips, and the subdivisions to be divided by the division scheme can be determined first.
- the tasks are divided correspondingly according to the task division method in step A1.
- the above heterogeneous chip includes four computing chips, then the number of the above division schemes is three, and the three division schemes are respectively the first scheme: divide the task into 4 subtasks, and the second scheme: divide the task into 3 subtasks, the third scheme: divide the task into 2 subtasks.
- the resource conversion efficiency corresponding to each division scheme is calculated respectively.
- the electronic device can calculate the resource conversion efficiency based on the task's floating point calculation amount, throughput rate, and hardware resources occupied by the task.
- the resource conversion efficiency indicates the processing capability of the unit hardware resource in the task processing process of the heterogeneous chip, and reflects the degree of optimization of the hardware resource by the corresponding partitioning scheme. The greater the resource conversion efficiency, the stronger the processing capability of the unit hardware resource of the heterogeneous chip. Conversely, the lower the resource conversion efficiency, the weaker the processing capability of the unit hardware resource of the heterogeneous chip. Therefore, after calculating the resource conversion efficiency corresponding to each division scheme, compare the numerical value of the resource conversion efficiency corresponding to each division scheme, and use the division scheme with the highest resource conversion efficiency in each division scheme as the final division scheme.
- the resource conversion efficiency corresponding to the first scheme is 0.5
- the resource conversion efficiency corresponding to the second scheme is 0.6
- the resource conversion corresponding to the third scheme The efficiency is 0.7
- the third scheme is selected as the final division scheme by comparing the size of resource conversion efficiency.
- step B2 specifically includes:
- ⁇ is the resource conversion efficiency
- P is the calculation amount of a single task
- ⁇ is the throughput rate during the execution of a single task by the heterogeneous chip
- N is the number of calculation chips.
- the above-mentioned throughput rate can be calculated according to the number of subtasks corresponding to the division scheme and the running time of the subtasks, and the number of calculation chips mentioned above is the number of calculation chips actually used according to the number of subtasks.
- step A1 specifically includes:
- multiple division schemes are determined according to the execution time of the aforementioned single task and the number of computing chips in the aforementioned heterogeneous chip.
- the number of division schemes depends on the number of computing chips, and the subdivisions to be divided by the division scheme can be determined first.
- the tasks are divided correspondingly according to the task division method in step A1.
- the above heterogeneous chip includes four computing chips, then the number of the above division schemes is three, and the three division schemes are respectively the first scheme: divide the task into 4 subtasks, and the second scheme: divide the task into 3 subtasks, the third scheme: divide the task into 2 subtasks.
- the resource utilization rate corresponding to each division scheme is calculated respectively, and the above resource utilization rate indicates the hardware resource size occupied when the subtask runs on the corresponding single computing chip.
- the above filtering condition is that the resource utilization rate corresponding to the division scheme is greater than the preset resource utilization rate. Threshold, when the division scheme meets the screening conditions, the performance of the computing chip can be fully utilized.
- the primary partitioning scheme with the largest number of subtasks is selected as the final partitioning scheme.
- the above-mentioned tasks are divided into at least two sub-tasks in sequence.
- the resource utilization rate corresponding to the above-mentioned first scheme is 50%
- the resource utilization rate corresponding to the second scheme is 70%
- the resource utilization rate corresponding to the third scheme is 80%.
- the filter condition is the resource utilization rate corresponding to the division scheme. If it is greater than 60%, then the primary division plan selected according to the screening condition includes the second plan and the third plan. If the number of subtasks corresponding to the second scheme is 2, and the number of subtasks corresponding to the third scheme is 1, the second scheme is selected as the final division scheme. According to the final division plan, the task is divided into 2 subtasks.
- Step 103 Distribute each subtask to different computing chips in the heterogeneous chip
- each divided subtask needs to be sent to different computing chips for execution, and one computing chip executes one subtask.
- the above heterogeneous chip includes 5 computing chips, and the number of subtasks is 3, then the subtasks are sent to 3 of the 5 computing chips for execution, and the remaining 2 computing chips do not perform subtasks. deal with.
- the computing chips in the heterogeneous chips are connected in sequence through a bus. Taking the computing chip as an FPGA as an example, before distributing each subtask to different FPGAs, each subtask needs to be converted into a bitstream file, and then the bitstream file corresponding to each subtask is programmed to the corresponding FPGA.
- Step 104 Control the above-mentioned different computing chips to sequentially process the above-mentioned subtasks in the first pipeline mode.
- the number of stages of the first pipeline corresponding to the first pipeline mode is the same as the number of the subtasks
- the first stage operation time of the first pipeline includes the execution time of one subtask and the execution time of two adjacent subtasks. Calculate the time for transmitting the data corresponding to the above subtasks between chips.
- the operation time of each pipeline stage of the first pipeline is equal. It should be noted that the execution time of each subtask on the computing chip is equal, and the time for transmitting the data corresponding to the subtask between adjacent computing chips is also the same. The execution time of the subtask on the computing chip is greater than the corresponding time. The time for the data corresponding to the above subtasks to be transmitted between adjacent computing chips.
- the first stage of the first pipeline includes the computing chip executing the corresponding subtask and transmitting the processing result of the subtask to the next computing chip.
- the time of the data corresponding to the task, the operation time of the first stage of the first pipeline is equal to t m + t l .
- the first task, the second task, and the third task are all convolutional neural networks, and the convolutional neural network includes a 2-layer convolutional layer, a 2-layer pooling layer, and a 1-layer fully connected layer.
- Conv+Pool represents a subtask that includes a convolutional layer and a pooling layer
- Fullyconn represents a subtask that includes a fully connected layer.
- step 104 specifically includes:
- the above-mentioned task processing method further includes:
- the execution of a subtask and the transmission of data corresponding to the subtask between two adjacent computing chips are respectively used as the first stage of the second pipeline corresponding to the second pipeline mode, and the first stage calculation of the second pipeline Time is equal to the execution time of a subtask.
- the execution of a subtask and the transmission of data corresponding to the subtask between two adjacent computing chips are used as the first stage of the first pipeline, and the second pipeline is to combine a subtask.
- the execution of the task and the transmission of the data corresponding to the subtask between two adjacent computing chips are respectively used as the first stage of the second pipeline corresponding to the second pipeline mode.
- the first stage of the above-mentioned second pipeline may be the execution of a subtask, or it may be the transmission of data corresponding to a subtask between two adjacent computing chips. It should be noted that, in order to make the calculation time of each stage of the second pipeline equal, a delay is added after the data corresponding to the subtask is transmitted between adjacent computing chips, so that the execution time of a subtask is equal to the aforementioned The time for the data corresponding to the subtask to be transmitted between two adjacent computing chips.
- each stage of the second pipeline the operation time of each stage of the second pipeline is equal, and each stage of the subtask processing pipeline stage of the second pipeline processes one subtask, when there are multiple tasks to be processed, the first The two pipelines can process the above-mentioned multiple tasks in parallel. That is, when the second stage of the second pipeline (the data transmission pipeline stage) starts data transmission of the first task, the first stage of the second pipeline starts to process the second task, and the third stage of the second pipeline When the first task is processed, the first stage of the second pipeline starts to process the third task, and so on.
- Figure 2 is drawn for explanation.
- Part (b) in Figure 2 is the process of the second pipeline processing three tasks in parallel, where t m is the execution time of the subtasks, and the operation time of the first stage of the second pipeline is equal to t m .
- the first task, the second task, and the third task are all convolutional neural networks, and the convolutional neural network includes a 2-layer convolutional layer, a 2-layer pooling layer, and a 1-layer fully connected layer.
- Conv+Pool represents a subtask that includes a convolutional layer and a pooling layer
- Fullyconn represents a subtask that includes a fully connected layer
- Latency is a data transmission pipeline stage.
- the electronic device can be a ZYNQ 7035 series development board launched by Xilinx, and the computing chip can be a Field Programmable Gate Array (FPGA), which is not limited here.
- the ZYNQ 7035 series development board communicates with the host computer through the Ethernet port on the PS (Processing System) side.
- the RapidIO protocol is used, and the high-speed serial transceiver communicates with each FPGA.
- the user can communicate to the ZYNQ 7035 series development board through the host computer.
- Send task execution instructions The ZYNQ 7035 series development board receives the execution instructions and controls each FPGA to perform the above tasks. When each FPGA finishes executing the above tasks and obtains the processing result, the ZYNQ 7035 series development board receives the processing result and sends the processing result. Send to the host computer through the Ethernet port.
- the execution instruction of a single task is first received; then the single task is divided into at least two subtasks in sequence, and each subtask is distributed to different computing chips in the heterogeneous chip. ; Finally, the different computing chips are controlled to sequentially process the subtasks in the first pipeline mode, wherein the number of first pipeline stages corresponding to the first pipeline mode is the same as the number of subtasks, and the first-stage operation of the first pipeline
- the time includes the execution time of a subtask and the time for transmitting the data corresponding to the aforementioned subtask between two adjacent computing chips.
- the computing chip can immediately start processing the subtasks of the next task after processing the subtasks of one task, which greatly reduces the waiting time of processing resources when processing tasks and improves the processing efficiency of hardware resources.
- FIG. 4 shows a schematic structural diagram of a task processing device for a heterogeneous chip provided by an embodiment of the present application.
- the task processing device for a heterogeneous chip can be applied to electronic equipment.
- electronic equipment For ease of description, only the implementation of the present application is shown. Examples of related parts.
- the task processing device 400 of the heterogeneous chip includes:
- the receiving module 401 is used to receive the execution instruction of a single task
- the dividing module 402 is used to divide the above single task into at least two subtasks in a sequential order;
- a distribution module 403, configured to distribute each subtask to different computing chips in the heterogeneous chip
- the processing module 404 is configured to control the above-mentioned different computing chips to sequentially process the above-mentioned subtasks in the first pipeline mode, wherein the first pipeline stage corresponding to the first pipeline mode is the same as the number of subtasks, and the above-mentioned first pipeline mode
- the first-level computing time includes the execution time of a subtask and the time for transmitting the data corresponding to the subtask between two adjacent computing chips.
- the above-mentioned processing module 404 further includes:
- the mode threshold calculation unit is configured to calculate the mode threshold according to the number of subtasks of a single task, the operation time of the subtasks, and the time for transmitting the data corresponding to the subtasks between two adjacent computing chips;
- the first control unit is configured to control the different computing chips to sequentially process the subtasks in the first pipeline mode if the number of the same single tasks is less than or equal to the mode threshold.
- the task processing device 400 of the heterogeneous chip further includes:
- the second control unit is used to control the above-mentioned different computing chips to sequentially process the above-mentioned subtasks in a second pipeline mode, wherein the execution of one subtask and the transmission of data corresponding to the above-mentioned subtasks between two adjacent computing chips They are respectively used as the first stage of the second pipeline corresponding to the second pipeline mode, and the first stage operation time of the second pipeline is equal to the execution time of one subtask.
- the above-mentioned dividing module 402 further includes:
- the execution time dividing unit is configured to divide the single task into at least two sequential subtasks according to the execution time of the single task and the hardware resource information of the heterogeneous chip, wherein the output of the previous subtask is regarded as adjacent to it. The input of the next subtask.
- the foregoing execution time dividing unit further includes:
- the first scheme determining subunit is configured to determine at least one division scheme according to the execution time of the aforementioned single task and the hardware resource information of the aforementioned heterogeneous chip, and the aforementioned division scheme is a scheme of dividing a single task into at least two subtasks;
- the efficiency calculation subunit is used to calculate the resource conversion efficiency corresponding to each division scheme, where the above resource conversion efficiency indicates the data processing capability of the unit hardware resource of the heterogeneous chip;
- the first final plan determination subunit is used to select a corresponding division plan with the greatest resource conversion efficiency from the above at least one division plan as the final division plan;
- the first final division subunit is configured to divide the single task into at least two subtasks in sequence according to the final division scheme.
- the foregoing efficiency calculation subunit further includes:
- the efficiency formula calculation subunit is used to calculate the resource conversion efficiency corresponding to each division scheme according to the preset resource conversion efficiency formula.
- the foregoing execution time dividing unit further includes:
- the second scheme determining subunit is configured to determine at least one division scheme according to the execution time of the aforementioned single task and the hardware resource information of the aforementioned heterogeneous chip, and the aforementioned division scheme is a scheme of dividing a single task into at least two subtasks;
- the screening subunit is used to screen out at least one primary division plan that meets a preset screening condition from the above at least one division plan, where the screening condition is that the resource utilization rate corresponding to the division scheme is greater than the preset resource utilization threshold. ;
- the second final plan determination subunit is used to select the primary division plan with the largest number of subtasks from the at least one primary division plan as the final division plan;
- the second final division subunit is configured to divide the single task into at least two subtasks in sequence according to the final division scheme.
- the execution instruction of a single task is first received; then the single task is divided into at least two subtasks in sequence, and each subtask is distributed to different computing chips in the heterogeneous chip. ; Finally, control the above-mentioned different computing chips to sequentially process the above-mentioned subtasks in the first pipeline mode, wherein the first pipeline stage corresponding to the first pipeline mode is the same as the number of subtasks, and the first-stage operation of the first pipeline
- the time includes the execution time of a subtask and the time for transmitting the data corresponding to the aforementioned subtask between two adjacent computing chips.
- the computing chip can immediately start processing the subtasks of the next task after processing the subtasks of one task, which greatly reduces the waiting time of processing resources when processing tasks and improves the processing efficiency of hardware resources.
- FIG. 5 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
- the electronic device 5 of this embodiment includes: at least one processor 50 (only one is shown in FIG. 5), a processor, a memory 51, and a memory 51 that is stored in the memory 51 and can be stored in the at least one processor 50.
- the processor 50 executes the computer program 52, the following steps are implemented:
- the foregoing Controlling the aforementioned different computing chips to sequentially process the aforementioned subtasks in the first pipeline mode includes:
- the different computing chips are controlled to sequentially process the subtasks in the first pipeline mode.
- the foregoing task processing method further includes:
- the foregoing division of the foregoing single task into at least two sequential subtasks includes:
- the single task is divided into at least two subtasks in sequence, where the output of the previous subtask is used as the input of the next subtask adjacent to it. .
- the foregoing single task is divided into sequential order based on the execution time of the foregoing single task and the hardware resource information of the foregoing heterogeneous chip. At least two subtasks, including:
- the above-mentioned single task is divided into at least two sub-tasks in sequence.
- the foregoing calculation of the resource conversion efficiency corresponding to each division scheme includes:
- the foregoing single task is divided into sequential order based on the execution time of the foregoing single task and the hardware resource information of the foregoing heterogeneous chip. At least two subtasks, including:
- the above-mentioned single task is divided into at least two sub-tasks in sequence.
- the electronic device may include, but is not limited to, a processor 50 and a memory 51.
- FIG. 5 is only an example of the electronic device 5, and does not constitute a limitation on the electronic device 5. It may include more or less components than shown in the figure, or a combination of certain components, or different components , For example, can also include input and output devices, network access devices, and so on.
- the so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), and the processor 50 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (Application Specific Integrated Circuits). , ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
- the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
- the above-mentioned memory 51 may be an internal storage unit of the above-mentioned electronic device 5 in some embodiments, for example, a hard disk or a memory of the electronic device 5.
- the above-mentioned memory 51 may also be an external storage device of the above-mentioned electronic device 5, such as a plug-in hard disk equipped on the above-mentioned electronic device 5, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital). , SD) card, flash card (Flash Card), etc.
- the aforementioned memory 51 may also include both an internal storage unit of the aforementioned electronic device 5 and an external storage device.
- the above-mentioned memory 51 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as the program code of the above-mentioned computer program.
- the aforementioned memory 51 can also be used to temporarily store data that has been output or will be output.
- the execution instruction of a single task is first received; then the single task is divided into at least two subtasks in sequence, and each subtask is distributed to different computing chips in the heterogeneous chip. ; Finally, the different computing chips are controlled to sequentially process the subtasks in the first pipeline mode, wherein the number of first pipeline stages corresponding to the first pipeline mode is the same as the number of subtasks, and the first-stage operation of the first pipeline
- the time includes the execution time of a subtask and the time for transmitting the data corresponding to the aforementioned subtask between two adjacent computing chips.
- the computing chip can immediately start processing the subtasks of the next task after processing the subtasks of one task, which greatly reduces the waiting time of processing resources when processing tasks and improves the processing efficiency of hardware resources.
- the embodiments of the present application also provide a computer-readable storage medium.
- the above-mentioned computer-readable storage medium stores a computer program.
- the steps in the above-mentioned method embodiments can be realized.
- the embodiments of the present application provide a computer program product.
- the computer program product runs on an electronic device, the electronic device can realize the steps in the foregoing method embodiments when the electronic device is executed.
- the above integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- this application implements all or part of the processes in the above-mentioned embodiments and methods, which can be completed by instructing relevant hardware through a computer program.
- the above-mentioned computer program can be stored in a computer-readable storage medium, and the computer program can be stored in a computer-readable storage medium. When executed by the processor, the steps of the foregoing method embodiments can be implemented.
- the above-mentioned computer program includes computer program code, and the above-mentioned computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
- the above-mentioned computer-readable medium may at least include: any entity or device capable of carrying computer program code to a task processing device/electronic device of a heterogeneous chip, a recording medium, a computer memory, a read-only memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunications signal, and software distribution medium.
- ROM Read-Only Memory
- RAM Random Access Memory
- electrical carrier signal telecommunications signal
- software distribution medium for example, U disk, mobile hard disk, floppy disk or CD-ROM, etc.
- computer-readable media cannot be electrical carrier signals and telecommunication signals.
- the disclosed apparatus/network equipment and method may be implemented in other ways.
- the device/network device embodiments described above are merely illustrative.
- the division of the above-mentioned modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units or Components can be combined or integrated into another system, or some features can be omitted or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Multi Processors (AREA)
- Design And Manufacture Of Integrated Circuits (AREA)
Abstract
Description
Claims (10)
- 一种异构芯片的任务处理方法,其特征在于,包括:接收单个任务的执行指令;将所述单个任务划分为有先后顺序的至少二个子任务;将各个子任务分发到所述异构芯片中的不同的计算芯片上;控制所述不同的计算芯片以第一流水线模式依次处理所述各个子任务,其中,第一流水线模式对应的第一流水线级数与子任务的个数相同,且所述第一流水线的一级运算时间包括一个子任务的执行时间和在相邻两个计算芯片之间传输所述子任务对应的数据的时间。
- 根据权利要求1所述的任务处理方法,其特征在于,若接收到至少二个相同的单个任务的执行指令,则所述控制所述不同的计算芯片以第一流水线模式依次处理所述各个子任务,包括:根据单个任务的子任务的个数、子任务的运算时间以及在相邻两个计算芯片之间传输所述子任务对应的数据的时间,计算模式阈值;若所述相同的单个任务的个数小于或等于所述模式阈值,则控制所述不同的计算芯片以第一流水线模式依次处理所述各个子任务。
- 根据权利要求2所述的任务处理方法,其特征在于,若所述相同的单个任务的个数大于所述模式阈值,则所述任务处理方法还包括:控制所述不同的计算芯片以第二流水线模式依次处理所述各个子任务,其中,一个子任务的执行和所述子任务对应的数据在相邻两个计算芯片之间的传输分别作为所述第二流水线模式对应的第二流水线的一级,且所述第二流水线的一级运算时间等于一个子任务的执行时间。
- 根据权利要求1所述的任务处理方法,其特征在于,所述将所述单个任务划分为有先后顺序的至少二个子任务,包括:根据所述单个任务的执行时间和所述异构芯片的硬件资源信息,将所述单个任务划分为有先后顺序的至少二个子任务,其中,前一个子任务的输出作为与其相邻的后一个子任务的输入。
- 根据权利要求4所述的任务处理方法,其特征在于,所述根据所述单个任务的执行时间和所述异构芯片的硬件资源信息,将所述单个任务划分为有先后顺序的至少二个子任务,包括:根据所述单个任务的执行时间和所述异构芯片的硬件资源信息,确定至少一个划分方 案,所述划分方案为将单个任务划分为至少二个子任务的方案;计算每个划分方案对应的资源转换效率,其中,所述资源转换效率指示了异构芯片的单位硬件资源的数据处理能力;从所述至少一个划分方案中选取对应的资源转换效率最大的划分方案作为最终划分方案;根据所述最终划分方案将所述单个任务划分为有先后顺序的至少二个子任务。
- 根据权利要求5所述的任务处理方法,其特征在于,所述计算每个划分方案对应的资源转换效率,包括:根据预设的资源转换效率公式计算每个划分方案对应的资源转换效率,所述资源转换效率公式为γ=Pβ/N,其中,γ为资源转换效率,P为单个任务的计算量,β为异构芯片执行单个任务过程中的吞吐率,N为计算芯片的个数。
- 根据权利要求4所述的任务处理方法,其特征在于,所述根据所述单个任务的执行时间和所述异构芯片的硬件资源信息将所述单个任务划分为有先后顺序的至少二个子任务,包括:根据所述单个任务的执行时间和所述异构芯片的硬件资源信息确定至少一个划分方案,所述划分方案为将单个任务划分为至少二个子任务的方案;从所述至少一个划分方案中筛选出至少一个满足预设的筛选条件的初选划分方案,其中,所述筛选条件为划分方案对应的资源利用率大于预设的资源利用率阈值;从至少一个初选划分方案中选取子任务数目最大的初选划分方案作为最终划分方案;根据所述最终划分方案将所述单个任务划分为有先后顺序的至少二个子任务。
- 一种异构芯片的任务处理装置,其特征在于,包括:接收模块,用于接收单个任务的执行指令;划分模块,用于将所述单个任务划分为有先后顺序的至少二个子任务;分发模块,用于将各个子任务分发到所述异构芯片中的不同的计算芯片上;处理模块,用于控制所述不同的计算芯片以第一流水线模式依次处理所述各个子任务,其中,第一流水线模式对应的第一流水线级数与子任务的个数相同,且所述第一流水线的一级运算时间包括一个子任务的执行时间和在相邻两个计算芯片之间传输所述子任务对应的数据的时间。
- 一种电子设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上 运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。
- 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911142365 | 2019-11-20 | ||
CN201911142365.7 | 2019-11-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021097962A1 true WO2021097962A1 (zh) | 2021-05-27 |
Family
ID=70517887
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/124350 WO2021097962A1 (zh) | 2019-11-20 | 2019-12-10 | 一种异构芯片的任务处理方法、任务处理装置及电子设备 |
PCT/CN2020/129492 WO2021115052A1 (zh) | 2019-11-20 | 2020-11-17 | 一种异构芯片的任务处理方法、任务处理装置及电子设备 |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/129492 WO2021115052A1 (zh) | 2019-11-20 | 2020-11-17 | 一种异构芯片的任务处理方法、任务处理装置及电子设备 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111142938B (zh) |
WO (2) | WO2021097962A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114741202A (zh) * | 2022-04-27 | 2022-07-12 | 苏州浪潮智能科技有限公司 | 一种fpga设备的算法配置方法、装置、设备及存储介质 |
CN115549854A (zh) * | 2021-06-30 | 2022-12-30 | 上海寒武纪信息科技有限公司 | 循环冗余校验方法、装置、存储介质以及电子设备 |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021097962A1 (zh) * | 2019-11-20 | 2021-05-27 | 深圳先进技术研究院 | 一种异构芯片的任务处理方法、任务处理装置及电子设备 |
CN113742089B (zh) * | 2021-11-04 | 2022-02-18 | 苏州浪潮智能科技有限公司 | 异构资源中神经网络计算任务的分配方法、装置和设备 |
CN115016847B (zh) * | 2022-08-08 | 2022-12-20 | 沐曦集成电路(上海)有限公司 | 提高流水线吞吐的方法、装置及电子设备 |
CN115712499A (zh) * | 2022-11-09 | 2023-02-24 | 北京城建设计发展集团股份有限公司 | 一种轨交业务ai芯片驱动任务处理方法及系统 |
CN116187399B (zh) * | 2023-05-04 | 2023-06-23 | 北京麟卓信息科技有限公司 | 一种基于异构芯片的深度学习模型计算误差定位方法 |
CN116382880B (zh) * | 2023-06-07 | 2023-08-11 | 成都登临科技有限公司 | 任务执行方法、装置、处理器、电子设备及存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101339523A (zh) * | 2007-07-05 | 2009-01-07 | 国际商业机器公司 | 多处理器环境中的流水线处理方法和设备 |
CN103810137A (zh) * | 2014-01-07 | 2014-05-21 | 南京大学 | 一种基于多fpga平台的ncs算法并行化的方法 |
CN103838552A (zh) * | 2014-03-18 | 2014-06-04 | 北京邮电大学 | 4g宽带通信系统多核并行流水线信号的处理系统和方法 |
US20140317380A1 (en) * | 2013-04-18 | 2014-10-23 | Denso Corporation | Multi-core processor |
CN104615413A (zh) * | 2015-02-13 | 2015-05-13 | 赛诺威盛科技(北京)有限公司 | 一种流水线任务自适应并行方法 |
CN108984283A (zh) * | 2018-06-25 | 2018-12-11 | 复旦大学 | 一种自适应的动态流水线并行方法 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9122523B2 (en) * | 2012-05-03 | 2015-09-01 | Nec Laboratories America, Inc. | Automatic pipelining framework for heterogeneous parallel computing systems |
CN104866460B (zh) * | 2015-06-04 | 2017-10-10 | 电子科技大学 | 一种基于SoC的容错自适应可重构系统与方法 |
CN106227591B (zh) * | 2016-08-05 | 2019-10-25 | 中国科学院计算技术研究所 | 在异构多核片上系统上进行无线通信调度的方法和装置 |
CN108205465B (zh) * | 2016-12-20 | 2021-06-15 | 北京中科晶上科技股份有限公司 | 流式应用程序的任务动态调度方法和装置 |
US10795729B2 (en) * | 2018-04-28 | 2020-10-06 | Cambricon Technologies Corporation Limited | Data accelerated processing system |
CN109857562A (zh) * | 2019-02-13 | 2019-06-07 | 北京理工大学 | 一种众核处理器上访存距离优化的方法 |
WO2021097962A1 (zh) * | 2019-11-20 | 2021-05-27 | 深圳先进技术研究院 | 一种异构芯片的任务处理方法、任务处理装置及电子设备 |
-
2019
- 2019-12-10 WO PCT/CN2019/124350 patent/WO2021097962A1/zh active Application Filing
- 2019-12-10 CN CN201911260085.6A patent/CN111142938B/zh active Active
-
2020
- 2020-11-17 WO PCT/CN2020/129492 patent/WO2021115052A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101339523A (zh) * | 2007-07-05 | 2009-01-07 | 国际商业机器公司 | 多处理器环境中的流水线处理方法和设备 |
US20140317380A1 (en) * | 2013-04-18 | 2014-10-23 | Denso Corporation | Multi-core processor |
CN103810137A (zh) * | 2014-01-07 | 2014-05-21 | 南京大学 | 一种基于多fpga平台的ncs算法并行化的方法 |
CN103838552A (zh) * | 2014-03-18 | 2014-06-04 | 北京邮电大学 | 4g宽带通信系统多核并行流水线信号的处理系统和方法 |
CN104615413A (zh) * | 2015-02-13 | 2015-05-13 | 赛诺威盛科技(北京)有限公司 | 一种流水线任务自适应并行方法 |
CN108984283A (zh) * | 2018-06-25 | 2018-12-11 | 复旦大学 | 一种自适应的动态流水线并行方法 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115549854A (zh) * | 2021-06-30 | 2022-12-30 | 上海寒武纪信息科技有限公司 | 循环冗余校验方法、装置、存储介质以及电子设备 |
CN114741202A (zh) * | 2022-04-27 | 2022-07-12 | 苏州浪潮智能科技有限公司 | 一种fpga设备的算法配置方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN111142938B (zh) | 2023-07-07 |
CN111142938A (zh) | 2020-05-12 |
WO2021115052A1 (zh) | 2021-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021115052A1 (zh) | 一种异构芯片的任务处理方法、任务处理装置及电子设备 | |
US11836524B2 (en) | Memory interface for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11868163B2 (en) | Efficient loop execution for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11675598B2 (en) | Loop execution control for a multi-threaded, self-scheduling reconfigurable computing fabric using a reenter queue | |
US11915057B2 (en) | Computational partition for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11675734B2 (en) | Loop thread order execution control of a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11573796B2 (en) | Conditional branching control for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US11531543B2 (en) | Backpressure control using a stop signal for a multi-threaded, self-scheduling reconfigurable computing fabric | |
US20210255864A1 (en) | Multiple Types of Thread Identifiers for a Multi-Threaded, Self-Scheduling Reconfigurable Computing Fabric | |
US11635959B2 (en) | Execution control of a multi-threaded, self-scheduling reconfigurable computing fabric | |
JP2020537784A (ja) | ニューラルネットワークアクセラレーションのための機械学習ランタイムライブラリ | |
US11048656B2 (en) | Multi-threaded, self-scheduling reconfigurable computing fabric | |
TWI827792B (zh) | 多路徑神經網路、資源配置的方法及多路徑神經網路分析器 | |
AU2014203218B2 (en) | Memory configuration for inter-processor communication in an MPSoC | |
WO2005098623A2 (en) | Prerequisite-based scheduler | |
WO2021249192A1 (zh) | 图像处理方法及装置、机器视觉设备、电子设备和计算机可读存储介质 | |
US11061654B1 (en) | Synchronization of concurrent computation engines | |
WO2020156212A1 (zh) | 一种数据处理的方法、装置及电子设备 | |
CN113504893B (zh) | 一种智能芯片架构和高效处理数据的方法 | |
WO2022141321A1 (zh) | Dsp处理器及其并行计算方法 | |
CN115729704A (zh) | 算力资源分配方法、装置及计算机可读存储介质 | |
CN118504630A (zh) | 基于fpga上的定制指令和dma的神经网络加速器架构 | |
Schumacher et al. | IMORC: an infrastructure for performance monitoring and optimization of reconfigurable computers | |
CN107844442A (zh) | 请求源响应的仲裁方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19952971 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19952971 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 110123) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19952971 Country of ref document: EP Kind code of ref document: A1 |