CN114691311A - Method, device and computer program product for executing asynchronous task - Google Patents

Method, device and computer program product for executing asynchronous task Download PDF

Info

Publication number
CN114691311A
CN114691311A CN202011610670.7A CN202011610670A CN114691311A CN 114691311 A CN114691311 A CN 114691311A CN 202011610670 A CN202011610670 A CN 202011610670A CN 114691311 A CN114691311 A CN 114691311A
Authority
CN
China
Prior art keywords
task
sub
tasks
executed
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011610670.7A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Priority to CN202011610670.7A priority Critical patent/CN114691311A/en
Priority to PCT/CN2021/138702 priority patent/WO2022143194A1/en
Publication of CN114691311A publication Critical patent/CN114691311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/546Message passing systems or structures, e.g. queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/54Indexing scheme relating to G06F9/54
    • G06F2209/548Queue

Abstract

The present disclosure provides a method and apparatus for performing asynchronous tasks that may be implemented in a computing device, where the computing device may be included in a combined processing device that may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.

Description

Method, device and computer program product for executing asynchronous task
Technical Field
The present disclosure relates to the field of computers, and more particularly, to serial and parallel execution of tasks.
Background
In the current deep network training process, in order to accelerate the convergence speed of network training, some or even all training tasks (including computation tasks, communication tasks, control logic tasks, etc.) are usually issued to a special acceleration chip for execution (such as GPU, MLU, TPU, etc.).
The network training task is issued to the accelerator card by the CPU in an asynchronous mode for execution, the accelerator card has the concept of task queues, and the tasks on the same queue are sequentially executed according to the issuing sequence, so that the tasks on the same queue have a dependency relationship, and the tasks on different queues can be concurrently executed according to the idle condition of hardware resources. The current training task is usually issued in a queue for execution, which inevitably affects the execution efficiency of the task.
Disclosure of Invention
One purpose of the present disclosure is to overcome the defects of the prior art that communication or operation resources cannot be fully utilized and the fault tolerance capability is low.
According to a first aspect of the present disclosure, there is provided a method of executing an asynchronous task, comprising: dividing a total task in a task queue into a plurality of sub-tasks, wherein each sub-task is in a different sub-task queue; executing the plurality of sub-tasks in parallel; and responding to the sub-tasks being executed completely, so that the total task is executed completely.
According to a second aspect of the present disclosure, an apparatus for performing an asynchronous task, comprises: the dividing unit is configured to divide a total task in the task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; a sub-task execution unit configured to execute the plurality of sub-tasks in parallel; and the ending unit is configured to respond to the sub-tasks being completely executed so as to ensure that the total task is completely executed.
According to a third aspect of the present disclosure, there is provided a chip comprising the apparatus as described above.
According to a fourth aspect of the present disclosure, there is provided an electronic device comprising the chip as described above.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
The technical scheme of the disclosure can distribute one overall task to different sub-task queues, thereby accelerating the execution of the overall task. In addition, even if an execution error occurs in one sub-task queue, all sub-tasks do not need to be executed again, so that the fault tolerance or retransmission cost of the tasks is reduced, the task execution burden is reduced, and the fault tolerance or retransmission processing of the tasks can be realized under the condition that a user does not perceive the fault tolerance or retransmission processing.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:
FIG. 1a illustrates a flow diagram of a method of performing an asynchronous task according to one embodiment of the present disclosure;
FIG. 1b shows a schematic diagram of a task issue queue and a task execution queue, according to one embodiment of the present disclosure;
FIG. 2 illustrates forming a serial instruction queue into a plurality of parallel program instruction queues corresponding to parallel modules according to one embodiment of the present disclosure;
FIG. 2a illustrates a flow diagram of dividing a total task in a task queue into a plurality of sub-tasks according to one embodiment of the present disclosure;
FIG. 2b shows a schematic diagram of inserting an identification in a queue according to one embodiment of the present disclosure;
FIG. 3 shows a queue diagram according to another embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of an insertion of a second wait indication being modified, according to one embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of an apparatus to perform asynchronous tasks according to one embodiment of the present disclosure;
FIG. 6 illustrates a combination processing device;
fig. 7 illustrates an exemplary card.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
The above detailed description of the disclosed embodiments, and the specific examples used herein to explain the principles and implementations of the present disclosure, are presented only to assist in understanding the method and its central concept of the present disclosure. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.
Currently, the mainstream framework (such as tensrflow, Pytorch) only uses one special communication queue (comm _ queue) to perform communication tasks. When a communication library in charge of communication task acquires a task, the task is usually directly issued to a comm _ queue of the framework or an internal _ queue (NCCL) in the communication library for execution, such as the NCCL in charge of communication between GPUs. At present, communication tasks are executed in a queue, and when an error occurs in the communication tasks, the communication tasks need to be executed again from the beginning, so that the overall communication efficiency is reduced.
In the present disclosure, communication and computation tasks are usually issued as asynchronous tasks to different task queues (queues) on an acceleration chip (e.g., GPU, MLU) for execution, the asynchronous tasks on the same queue are executed serially according to the order of issuing the tasks, and the tasks on different queues may be executed concurrently.
It should be understood that, although the communication task is taken as an example, the task is not limited to the communication task, and may also involve various tasks such as operation or training of the neural network.
FIG. 1a illustrates a flow diagram of a method of performing an asynchronous task according to one embodiment of the present disclosure; FIG. 1b shows a schematic diagram of a task issue queue and a task execution queue according to one embodiment of the present disclosure.
As shown in fig. 1, the method of the present disclosure includes: in operation S110, dividing a total task in a task queue into a plurality of sub-tasks, where each sub-task is in a different sub-task queue; in operation S120, the plurality of tasking are performed in parallel; and in operation S130, in response to the sub-tasks being completely executed, thereby completing the overall task.
The above method is described in detail below with reference to fig. 1 b.
In fig. 1B, two types of queues are included, namely a task allocation queue LQ and a task execution queue PQ, a task issuing queue can receive a plurality of tasks, such as tasks a, B and C, etc., the tasks A, B and C are serially combined together when entering the task allocation queue LQ, and the execution sequence is A, B and C, that is, when task a is executed, tasks B and C need to wait, and task B can be executed only after task a is executed; and task C needs to wait for task B to finish execution before it can be executed. Such a task execution method cannot fully utilize the parallel running resources of the system, and especially when the execution time of one task is especially long or the communication data amount is especially large, the execution of other tasks is obviously blocked, so that the system performance is affected.
The tasks in the task allocation queue LQ may be regarded as a total task, and the total task is divided into a plurality of sub-tasks to be executed in parallel and placed in the task execution queue PQ for execution. When one overall task is divided into a plurality of sub tasks to be executed in parallel, the execution efficiency of the tasks can be remarkably improved.
In the present disclosure, taking the total task B as an example, the total task B may be divided into a plurality of sub tasks B1, B2, and so on, and here, two sub tasks B1 and B2 are taken as examples for explanation. It should be noted that the division number of the divided tasks may be other numbers, which depend on the execution capacity of the divided tasks and/or the size of the total task. For example, the execution capacity of each sub task is strong, the total task can be divided into a smaller number of sub tasks; in the case of equal execution capacity, if a certain total task is larger, the total task can be divided into a larger number of sub-tasks.
After dividing the overall task B into the partial tasks B1 and B2 and placing the partial tasks in different execution queues PQ1 and PQ2, respectively, the two partial tasks B1 and B2 may be executed in parallel in the execution queues PQ1 and PQ 2.
The execution requirements of the total task B and the sub tasks meet the following rules: 1. when the overall task B has not started execution, the subtasks B1 and B2 should also be in the not-started state; 2. when the overall task B starts executing, the subtasks B1 and B2 should also start executing; 3. other tasks (for example, C) after the task B in the task allocation queue LQ need to wait for the task B to be executed after the task B is executed; 4. when all of the subtasks B1 and B2 are completed, the total task B should be completed.
FIG. 2a shows a flowchart for dividing a total task in a task queue into a plurality of sub-tasks according to an embodiment of the present disclosure.
Thus, according to an embodiment of the present disclosure, dividing a total task in a task queue into a plurality of subtasks S110 includes: inserting a first write flag allowing the overall task to start executing in the queue in operation S1110; in operation S1120, inserting a first waiting flag that prohibits the sub-task from starting execution into the sub-task queue; and, in operation S1130, when the first write flag is not executed, executing the first wait flag to prohibit the tasking from starting execution.
Fig. 2b shows a schematic diagram of inserting an identification in a queue according to one embodiment of the present disclosure. The embodiment of fig. 2a is described in detail below in conjunction with fig. 2 b.
First, in order to control the execution of task B, it is necessary to insert a write flag, here exemplarily denoted as F0, before the task to be executed, and the subsequent task B starts to be executed only when the write flag F0 is executed or when the write flag F0 is changed to allow the execution of the next task. And if the write identification F0 is not executed, the corresponding task does not start to execute. The write identification may be inserted by an Atomic Operation (Atomic Operation). Atomic operations are operations that are not interrupted by thread scheduling mechanisms; this operation, once started, runs to the end without any context switch in between.
Accordingly, a wait flag f0 may be inserted before each subtask, the wait flag indicating that the subtask after the flag is disabled from executing. It should be understood that, although the first write flag F0 and the wait flag F0 in fig. 2b are named by different names, the write flag F0 and the wait flag F0 point to the same flag to detect whether the same flag changes.
According to one embodiment of the present disclosure, executing the plurality of tasking in parallel comprises: in response to the first write identification being executed, turning off the first wait identification, thereby executing the plurality of subtasks in parallel.
The write identifier F0 before writing the overall task and the wait identifier F0 before writing the sub task are in an associated relationship, only when the write identifier F0 allows the subsequent overall task to be executed, the wait identifier F0 is ended and the corresponding sub task is started to be executed, and if the write identifier F0 does not allow the subsequent overall task to be executed, the wait identifier F0 also makes the execution of the sub task in a wait state.
Fig. 3 shows a queue diagram according to another embodiment of the present disclosure.
According to one embodiment of the present disclosure, a second wait identification may be inserted in the overall task queue to prohibit other tasks after the overall task from being executed.
As shown in fig. 3, a second waiting flag may be inserted after the first writing flag in the overall task queue, and when the second waiting flag is executed, it indicates that other overall tasks subsequent to the current overall task need to be in a waiting state, and before the current overall task is not executed completely, other overall tasks cannot start to be executed.
As can be seen from the above description, when the first write flag F0 is executed in the distribution queue, the overall task B corresponding to the first write flag F0 starts to execute, that is, the sub tasks B1 and B2 of the overall task B end the waiting state and start to execute; thereafter, when the second waiting flag F1 in the distribution queue is executed, other tasks after the total task B in the distribution queue enter a waiting state and are not executed when the total task B is executed.
FIG. 4 shows a schematic diagram of a modified insertion of a second wait indication according to one embodiment of the present disclosure.
According to an embodiment of the present disclosure, each time one sub task is executed, modifying the second waiting flag F1 until all sub tasks are executed; and in response to the completion of the execution of all the sub tasks, modifying the second waiting identifier F1 into a waiting ending identifier, so that the total task is completed.
Next, as shown in fig. 4, each of the partial tasks b1 and b2 starts to be executed in the execution queue PQ, and each time one of the partial tasks b1 or b2 is completed, the second wait flag F1 may be modified accordingly, for example, the second wait flag F1 may be incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the subtasks are performed. Therefore, the second waiting flag F1 may initially set a target value, which is gradually approached by the second waiting flag F1 as the execution of the subtask b1 or b2 is completed, and when the second waiting flag F1 reaches the preset target value, it means that all subtasks b1 and b2 are completed. It should be understood that the modification manner of the second waiting list F1 may be many, and is not limited to "add one" as described above, for example, one may be subtracted each time the modification is performed until the second waiting list F1 is smaller than the predetermined threshold. The present disclosure does not set any limit on how the second waiting list is modified.
The above-mentioned "second waiting flag F1 reaches the target value" can also be understood as a waiting end flag, which means that the current overall task B has already been executed and can start executing other tasks.
When the total task is divided into a plurality of sub tasks, a plurality of dividing modes can be provided, and the total task can be randomly divided into a plurality of sub tasks; the total task may be divided into a fixed number of sub-tasks; the total task may be divided into a number of divided tasks corresponding to the number of processors, and the like, according to the number of processors responsible for each execution queue PQ.
According to a preferred embodiment of the present disclosure, a total task in the task queue may be divided into a plurality of sub-tasks with equivalent execution time.
The above-mentioned equivalence of execution times does not mean that the size of each sub-task itself is the same. For example, for 100M of computing data, there are 4 processing cores participating in the operation, and theoretically each processing core can participate in 25M of operation, so that the 4 processing cores will complete the operation in the same time, thereby reducing the total operation time as much as possible. However, since a processing core is also involved in other operation and has lower processing capability than other processing cores, the respective processing capabilities of the 4 processing cores should be considered to allocate the corresponding tasks, so that the time for each processing core to complete the operation is the same or substantially the same, which helps to shorten the overall running time of the total task. Therefore, the principle of dividing the total task into a plurality of divided tasks is to divide according to the capability of the resource to execute the task to realize that a plurality of resources can be equivalent in processing time.
According to one embodiment of the disclosure, the total task is divided into a plurality of sub tasks in response to the data amount of the total task exceeding a certain threshold. It should be understood that the total task is divided into a plurality of sub-tasks, and the total amount of data involved in each task is also considered, and if the total amount of data involved in a certain task is small and the processing time for the total task is already less than the time for transmitting the data generated by executing the total task, the total task is not necessarily divided. Similarly, if the time to read the data required by the overall task constitutes a bottleneck, i.e. the time to read the data is greater than the time to execute the overall task, there is no need to divide the overall task further.
According to an embodiment of the present disclosure, the method of the present disclosure further comprises: and in response to one or more sub tasks being in error, re-running the sub tasks in error.
When multiple sub-tasks are executed in the execution queue PQ, errors may occur, such as errors in the result of operations during execution, errors in data throughput, errors in data transmission, and the like. In the conventional scheme, if the total task is not divided into a plurality of sub-tasks, once an error occurs in the execution process of the task, the whole total task needs to be executed again, so that the processing capacity is seriously wasted, and the overall performance of the system is reduced.
In the scheme of the disclosure, because the multiple sub-tasks are all in different execution queues, the execution queues run independently and do not interfere with each other, and therefore, even if a certain sub-task is wrong in the execution process, the execution of other sub-tasks is not affected. Therefore, if an error occurs in the execution of one sub task, only the sub task in which the error occurs may be rerun without rerun all sub tasks or the entire task once. While the sub task in which the error occurs is running, the other queues may be in an idle state, or other sub tasks may be executed at the same time. Therefore, the case of dividing a total task into a plurality of parallel sub-tasks in the present disclosure can improve the utilization rate of system processing resources and improve the processing efficiency.
According to one embodiment of the disclosure, in response to an error occurring in one or more of the sub-tasks, the sub-task in which the error occurred is further split into a plurality of sub-tasks to facilitate parallel execution.
When one sub task has an error and needs to be re-executed, the sub task with the error can be added into the task allocation queue LQ as a new total task, and the sub task is further divided into a plurality of sub tasks, and the sub task with the error is re-executed once in a plurality of parallel execution queues PQ. The sub-tasks with errors are further divided into a plurality of sub-tasks to be re-executed, so that the running efficiency of the system is further improved, and even if the execution of a certain sub-task is wrong, the time and the processing resources for correcting the mistake are greatly reduced.
The tasks described above may be of a wide variety, such as computational tasks, multiplicative tasks, convolutional tasks, weight computation tasks, communication tasks, and so forth. Therefore, depending on the task, the allocation queue may be a communication queue used in the deep learning framework, such as a special communication queue (Comm _ queue) used in tensflow, Pytorch, and the execution queue may be an execution queue in a communication library, such as an Internal execution queue (Internal _ queue) in NCCL communication library. FIG. 5 shows a schematic diagram of an apparatus for performing asynchronous tasks, the apparatus comprising: a dividing unit M510 configured to divide a total task in the task queue into a plurality of sub-tasks, where each sub-task is in a different sub-task queue; a sub-task execution unit M520 configured to execute the plurality of sub-tasks in parallel; and an ending unit M530 configured to respond to the sub-tasks being completely executed, thereby completing the overall task.
The present disclosure also provides a chip comprising the apparatus as shown in fig. 5.
The present disclosure also provides an electronic device comprising a chip as described above.
The present disclosure also provides an electronic device, including: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.
The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.
Fig. 6 illustrates a combined processing device 600 that includes the computing device 602 described above, a universal interconnect interface 604, and other processing devices 606. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 6 is a schematic view of a combined treatment apparatus.
Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.
A universal interconnect interface for transferring data and control instructions between a computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the memory module of the computing device can also be read and transmitted to other processing devices.
Optionally, the structure may further comprise a storage device 608, which is connected to the computing device and the other processing device, respectively. The storage device is used to store data in the computing device and the other processing devices, and is particularly suitable for data that cannot be stored in the computing device or the other processing devices.
The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the generic interconnect interface of the combined processing device is connected to some component of the apparatus. Some components are such as camera, display, mouse, keyboard, network card, wifi interface.
In some embodiments, the disclosure also discloses a chip packaging structure, which includes the chip.
In some embodiments, the disclosure also discloses a board card comprising the chip packaging structure. Referring to fig. 7, an exemplary board card is provided that may include other kits in addition to the chip 702, including but not limited to: a memory device 704, an interface arrangement 706 and a control device 708.
The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 710. Each group of the storage units is connected with the chip through a bus. It is understood that each set of the memory cells may be DDR SDRAM (Double Data Rate SDRAM).
DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 pellets (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.
The interface device is electrically connected with a chip in the chip packaging structure. The interface means is used to enable data transmission between the chip and an external device 712, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.
In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.
Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.
The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. With this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims (15)

1. A method of performing an asynchronous task, comprising:
dividing a total task in a task queue into a plurality of sub-tasks, wherein each sub-task is in a different sub-task queue;
executing the plurality of sub-tasks in parallel;
and responding to the sub-tasks being executed completely, so that the total task is executed completely.
2. The method of claim 1, wherein dividing a total task in a task queue into a plurality of sub-tasks comprises:
inserting a first write identification allowing the overall task to start executing in the queue;
inserting a first waiting identifier which forbids the sub tasks to start executing into the sub task queue;
when the first writing identification is not executed, executing the first waiting identification to prohibit the subtasks from starting to execute.
3. The method of claim 2, wherein executing the plurality of subtasks in parallel comprises:
in response to the first write identification being executed, turning off the first wait identification, thereby executing the plurality of subtasks in parallel.
4. The method of any of claims 1-3, further comprising: and inserting a second waiting identifier in the total task queue to prohibit other tasks after the total task from being executed.
5. The method of claim 4, further comprising:
when one sub task is executed, modifying the second waiting identifier until all sub tasks are executed;
and in response to the completion of the execution of all the sub tasks, modifying the second waiting identifier into a waiting ending identifier, so that the execution of the total task is completed.
6. A method according to any of claims 1-5, wherein a total task in the task queue is divided into a plurality of sub-tasks that are time-equivalent in execution time.
7. The method of any of claims 1-6, wherein the overall task is divided into a plurality of sub-tasks in response to an amount of data of the overall task exceeding a particular threshold.
8. The method of any of claims 1-7, further comprising: and in response to one or more sub tasks being in error, re-running the sub tasks in error.
9. The method of any of claims 1-8, in response to an error occurring in one or more of the sub-tasks, further splitting the erroneous sub-task into a plurality of sub-tasks for parallel execution.
10. The method of any of claims 1-9, wherein the task queue is a communication task queue.
11. An apparatus to perform asynchronous tasks, comprising:
the dividing unit is configured to divide a total task in the task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue;
a sub-task execution unit configured to execute the plurality of sub-tasks in parallel;
and the ending unit is configured to respond to the sub-tasks being completely executed so as to ensure that the total task is completely executed.
12. A chip comprising the apparatus of claim 11.
13. An electronic device comprising the chip of claim 12.
14. An electronic device, comprising:
one or more processors; and
memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-10.
15. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-10.
CN202011610670.7A 2020-12-30 2020-12-30 Method, device and computer program product for executing asynchronous task Pending CN114691311A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011610670.7A CN114691311A (en) 2020-12-30 2020-12-30 Method, device and computer program product for executing asynchronous task
PCT/CN2021/138702 WO2022143194A1 (en) 2020-12-30 2021-12-16 Method for executing asynchronous task, device, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011610670.7A CN114691311A (en) 2020-12-30 2020-12-30 Method, device and computer program product for executing asynchronous task

Publications (1)

Publication Number Publication Date
CN114691311A true CN114691311A (en) 2022-07-01

Family

ID=82132920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011610670.7A Pending CN114691311A (en) 2020-12-30 2020-12-30 Method, device and computer program product for executing asynchronous task

Country Status (1)

Country Link
CN (1) CN114691311A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594745A (en) * 2023-05-11 2023-08-15 阿里巴巴达摩院(杭州)科技有限公司 Task execution method, system, chip and electronic device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594745A (en) * 2023-05-11 2023-08-15 阿里巴巴达摩院(杭州)科技有限公司 Task execution method, system, chip and electronic device

Similar Documents

Publication Publication Date Title
US11797467B2 (en) Data processing device with transmission circuit
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
CN111767995B (en) Operation method, device and related product
CN111209243B (en) Data processing device, method and related product
CN109726800B (en) Operation method, device and related product
CN114691311A (en) Method, device and computer program product for executing asynchronous task
CN114764374A (en) Method and equipment for executing communication task in accelerator card system
EP4142217A1 (en) Inter-node communication method and device based on multiple processing nodes
US20230259737A1 (en) Integrated computing apparatus, chip, board card, device and computing method
CN105957131A (en) Graphic processing system and method thereof
EP4141685A1 (en) Method and device for constructing communication topology structure on basis of multiple processing nodes
CN111258732B (en) Data processing method, data processing device and electronic equipment
WO2022143194A1 (en) Method for executing asynchronous task, device, and computer program product
CN111767999A (en) Data processing method and device and related products
CN111340202A (en) Operation method, device and related product
CN111047030A (en) Operation method, operation device, computer equipment and storage medium
CN113033791B (en) Computing device, integrated circuit device, board card and order preserving method for order preserving
US11983535B2 (en) Artificial intelligence computing device and related product
CN111210011B (en) Data processing device and related product
CN113032298B (en) Computing device, integrated circuit device, board card and order preserving method for order preserving
US20220156077A1 (en) Artificial intelligence computing device and related product
CN111124497B (en) Operation method, operation device, computer equipment and storage medium
CN111783954B (en) Method, electronic device and storage medium for determining performance of neural network
CN113032299B (en) Bus system, integrated circuit device, board card and order preserving method for processing request
CN111723921B (en) Artificial intelligence computing device and related products

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination