WO2022143194A1 - Method for executing asynchronous task, device, and computer program product - Google Patents

Method for executing asynchronous task, device, and computer program product Download PDF

Info

Publication number
WO2022143194A1
WO2022143194A1 PCT/CN2021/138702 CN2021138702W WO2022143194A1 WO 2022143194 A1 WO2022143194 A1 WO 2022143194A1 CN 2021138702 W CN2021138702 W CN 2021138702W WO 2022143194 A1 WO2022143194 A1 WO 2022143194A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
acceleration
sub
tasks
communication
Prior art date
Application number
PCT/CN2021/138702
Other languages
French (fr)
Chinese (zh)
Inventor
柴安晨
吕尧
梁帆
Original Assignee
安徽寒武纪信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011610670.7A external-priority patent/CN114691311A/en
Priority claimed from CN202110055097.6A external-priority patent/CN114764374A/en
Application filed by 安徽寒武纪信息科技有限公司 filed Critical 安徽寒武纪信息科技有限公司
Publication of WO2022143194A1 publication Critical patent/WO2022143194A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt

Definitions

  • the present disclosure relates to the field of computers, and more particularly, to serial and parallel execution of tasks.
  • training tasks including computing tasks, communication tasks, control logic tasks, etc.
  • special acceleration chips for execution Such as GPU, MLU, TPU, etc.
  • the network training task will be sent by the CPU to the accelerator card for execution in an asynchronous form.
  • the accelerator card has the concept of a task queue.
  • the tasks on the same queue will be executed in the order in which they are issued. Therefore, the tasks on the same queue have dependencies and different tasks. Tasks on the queue can execute concurrently based on the availability of hardware resources. However, the current training tasks are usually only executed in one queue, which will inevitably affect the execution efficiency of the tasks.
  • the current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks.
  • the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs.
  • the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency.
  • a method for executing an asynchronous task comprising: dividing a total task in a task queue into a plurality of sub-tasks, each sub-task being in a different sub-task queue; executing all the tasks in parallel The plurality of sub-tasks are completed; in response to the completion of the execution of the sub-tasks, the execution of the total task is completed.
  • an apparatus for executing asynchronous tasks includes: a dividing unit configured to divide a total task in a task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue;
  • the sub-task execution unit is configured to execute the plurality of sub-tasks in parallel;
  • the end unit is configured to complete the execution of the total task in response to completion of the sub-task execution.
  • a chip comprising the apparatus as described above.
  • an electronic device including the chip as described above.
  • an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.
  • a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
  • the technical solution of the present disclosure can allocate a total task to different sub-task queues, thereby accelerating the execution of the total task.
  • a sub-task queue even if there is an error in the execution of a sub-task queue, there is no need to re-execute all tasks, thereby reducing the cost of task fault tolerance or retransmission, reducing the burden of task execution, and realizing tasks without the user's perception.
  • fault tolerance or retransmission processing 2020116106707
  • One purpose of the present disclosure is to overcome the defects in the prior art that a task cannot be issued to multiple queues for parallel execution, communication or computing resources cannot be fully utilized, and fault tolerance is low.
  • a method for performing a communication task in an accelerator card system wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one accelerator card among the plurality of accelerator cards can communicate with another accelerator card through a communication path; the method includes: establishing a communication task queue, the communication task queue includes a communication task and a state identifier for monitoring the execution state of the communication task; establishing communication The task execution queue is used for executing communication tasks between the acceleration cards through the communication path; in response to the execution of the communication tasks, the state identifier is changed to monitor the execution state of the communication tasks.
  • an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors
  • the processor when executed, causes the electronic device to perform the method as described above.
  • a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
  • At least one beneficial effect of the technical solutions of the present disclosure is to distinguish the communication task queue and the communication task execution queue, so that the user can perform tasks such as fault tolerance or retransmission without perception.
  • the technical solution of the present disclosure can also allocate one general communication task to different sub-communication task queues, thereby accelerating the execution of the general communication task.
  • even if an error occurs in the execution of a certain sub-communication task queue it is not necessary to re-execute all the sub-communication tasks, thereby reducing the burden of task execution.
  • Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure
  • Figure 1b shows a schematic diagram of a task issuing queue and a task execution queue according to an embodiment of the present disclosure
  • Figure 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure
  • Fig. 2b shows a schematic diagram of inserting a logo in a queue according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of inserting a second waiting flag being modified according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of an apparatus for executing an asynchronous task according to an embodiment of the present disclosure
  • Figure 6 shows a combined processing device
  • Figure 7 shows an exemplary board
  • FIG. 8a is a schematic structural diagram of an acceleration unit in an embodiment disclosed.
  • 8b, 9, 10, 11, and 12a-12c are multiple schematic structural diagrams of acceleration units according to embodiments of the present disclosure.
  • FIG. 13-FIG. 18 are schematic structural diagrams of acceleration components according to an embodiment of the present disclosure.
  • 19a-19c are schematic diagrams showing acceleration components as network topology
  • 20 is a schematic diagram of an acceleration device including a plurality of acceleration units according to an embodiment of the present disclosure
  • 21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment
  • 22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment
  • 23-27 are multiple schematic diagrams of an acceleration device including multiple acceleration components according to an embodiment of the present disclosure.
  • FIG. 29 is a schematic diagram of a matrix network topology based on the wireless extension of an acceleration device
  • FIG. 30 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure.
  • 31 is a schematic diagram of a network topology of another acceleration device
  • 32 is a schematic diagram of a network topology of another acceleration device
  • 35 shows a flowchart of a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure
  • Figure 36a shows a flowchart of a method for performing a communication task according to one embodiment of the present disclosure
  • Figure 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure
  • Figure 37a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure
  • Figure 37b shows a schematic diagram of inserting a logo in a queue according to one embodiment of the present disclosure
  • Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure.
  • FIG. 39 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.
  • the current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks.
  • the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs.
  • the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency.
  • communication and computing tasks are usually dispatched as asynchronous tasks to different task queues (queues) on acceleration chips (such as GPU, MLU) for execution, and asynchronous tasks on the same queue will be dispatched according to tasks Sequentially executed serially, tasks on different queues can be executed concurrently.
  • acceleration chips such as GPU, MLU
  • the communication task is used as an example above, the tasks in this paper are not limited to communication tasks, but also involve various tasks such as operation or training of neural networks.
  • Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure
  • Fig. 1b shows a schematic diagram of a task delivery queue and a task execution queue according to an embodiment of the present disclosure.
  • the method of the present disclosure includes: in operation S110, dividing a total task in the task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; in operation S120, executing in parallel the plurality of sub-tasks; and in operation S130, in response to the execution of the sub-tasks being completed, the execution of the total task is completed.
  • the task delivery queue can receive multiple tasks, such as tasks A, B and C, etc. These tasks A, B and When C enters the task allocation queue LQ, it is serially combined, and the execution order is A, B, and C. That is, when task A is executed, tasks B and C need to wait, while task B only waits until task A is executed. It can only be executed after completion; while task C needs to wait for task B to be executed before it can be executed.
  • Such a task execution method cannot make full use of the parallel running resources of the system, especially when the execution time of a certain task is particularly long or the amount of communication data is particularly large, the execution of other tasks will be obviously blocked, and the system performance will be affected.
  • the tasks in the task allocation queue LQ can be regarded as a total task, and the total task is divided into multiple sub-tasks executed in parallel, and placed in the task execution queue PQ for execution.
  • the execution efficiency of the task can be significantly improved.
  • the overall task B can be divided into a plurality of sub-tasks b1, b2, etc., and two sub-tasks b1 and b2 are used as an example for description here.
  • the number of divided tasks may be other numbers, which depend on the execution capability of the divided tasks and/or the size of the total tasks. For example, if the execution ability of each sub-task is strong, the total task can be divided into a smaller number of sub-tasks; and in the case of the same execution ability, if a certain total task is larger, then the total task can be divided into a smaller number of sub-tasks; Divide into a greater number of subtasks.
  • the two sub-tasks b1 and b2 can be executed in parallel in the execution queues PQ1 and PQ2 .
  • total task B and sub-tasks needs to meet the following rules: 1. When the overall task B has not started to be executed, the sub-tasks b1 and b2 should also be in the unstarted state; 2. When the overall task B starts to be executed, the sub-task b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the task allocation queue LQ need to wait for the completion of task B before they can be executed; 4. When the sub-tasks b1 and b2 are all executed, the total task B should also be completed.
  • FIG. 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure.
  • dividing a total task in the task queue into a plurality of sub-tasks S110 includes: in operation S1110, inserting a first write in the queue that allows the total task to start executing identification; in operation S1120, insert a first waiting mark that prohibits the start of execution of the sub-task in the sub-task queue; and, in operation S1130, when the first write mark is not executed, execute the first waiting mark A wait flag to inhibit the subtask from starting execution.
  • FIG. 2b shows a schematic diagram of inserting a marker in a queue according to an embodiment of the present disclosure. The specific implementation of FIG. 2a will be described in detail below in conjunction with FIG. 2b.
  • a write flag before the task to be executed which is exemplarily represented as F0, only when the write flag F0 is executed, or when the write flag F0 is changed to
  • the subsequent task B starts to be executed.
  • the writing flag F0 is not executed, the corresponding task does not start executing.
  • the write flag can be inserted through an Atomic Operation.
  • the so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.
  • a waiting flag f0 may be inserted before each subtask, and the waiting flag indicates that the execution of the subtasks after the flag is prohibited. It needs to be understood that although the first writing mark F0 and the waiting mark f0 in FIG. 2b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark, so as to detect whether the same mark occurs. Change.
  • executing the plurality of sub-tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-tasks in parallel .
  • the writing flag F0 before writing the total task and the waiting flag f0 before the sub-task are related. Only when the writing flag F0 allows the execution of the subsequent total task, the waiting flag f0 is ended and the corresponding sub-task is started. , and if the writing flag F0 does not allow the execution of the subsequent total tasks, the waiting flag f0 also makes the execution of the sub-tasks in a waiting state.
  • FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure.
  • a second waiting flag may be inserted into the total task queue to prohibit execution of other tasks after the total task.
  • a second waiting mark can be inserted after the first writing mark, and when the second waiting mark is executed, it indicates that other overall tasks after the current overall task need to be waiting state, other total tasks cannot start to execute until the current total task is executed.
  • the total task B corresponding to the first write flag F0 starts to be executed, that is, the sub-tasks b1 and b2 ends the waiting state and starts execution; after that, when the second waiting flag F1 in the allocation queue is executed, other tasks after the total task B in the allocation queue enter the waiting state, and are not executed when the total task B is executed.
  • FIG. 4 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.
  • each time one sub-task is executed the second waiting flag F1 is modified until all sub-tasks are executed; and in response to all sub-tasks being executed, the second waiting flag F1 is changed.
  • the modification is to wait for the end flag, so that the execution of the total task is completed.
  • each subtask b1 and b2 is executed in the execution queue PQ.
  • the second waiting flag F1 can be modified accordingly, for example, the first 2 Wait for the flag F1 to be incremented by one.
  • the number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-tasks are executed.
  • the second waiting flag F1 can be initially set with a target value, and as the execution of the sub-task b1 or b2 is completed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset target When the value is set, it means that all sub-tasks b1 and b2 are executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.
  • the second waiting flag F1 reaches the target value can also be understood as a waiting end flag, which means that the current total task B has been executed, and other tasks can be executed.
  • the total task can be randomly divided into multiple sub-tasks; the total task can be divided into a fixed number of sub-tasks; The number of processors in the queue PQ is used to divide the total task into the number of sub-tasks corresponding to the number of processors, and so on.
  • a total task in the task queue may be divided into multiple sub-tasks with equivalent execution time.
  • each subtask itself is the same size.
  • each processing core can participate in 25M operations.
  • the 4 processing cores will complete the operation in the same time, thereby reducing the total as much as possible. operation time.
  • a certain processing core also participates in other computing work, its processing capability is lower than that of other processing cores, so the respective processing capabilities of the four processing cores should be considered to allocate corresponding tasks, so that each processing core can complete the operation.
  • the time is the same or about the same, which will help reduce the overall run time of the total task. Therefore, the principle of dividing the total task into multiple sub-tasks is to divide the tasks according to the capabilities of the resources that execute the tasks, so that multiple resources can be equivalent in processing time.
  • the total task in response to the data amount of the total task exceeding a certain threshold, is divided into a plurality of sub-tasks. It should be understood that dividing the total task into multiple sub-tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a certain task is small, and the processing time of the total task is less than When the data generated by executing the total task is sent out, it will not be necessary to divide the total task. Similarly, if the time for reading data required by the total task constitutes a bottleneck, that is, the time for reading data is greater than the time for executing the total task, there is no need to further divide the total task.
  • the method of the present disclosure further includes: in response to one or more sub-tasks having an error, re-running the faulty sub-task.
  • errors may occur, such as an error in the operation result during the execution process, an error in data throughput, an error in data transmission, and so on.
  • the traditional scheme if the total task is not divided into multiple sub-tasks, once an error occurs during the execution of the task, the entire total task needs to be re-executed, which will seriously waste the processing power and cause the overall performance of the system to deteriorate. decline.
  • the sub-task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
  • the sub-task with the error can be added to the task allocation queue LQ as a new total task, and the sub-task is further divided into multiple sub-tasks, and the sub-tasks are divided into multiple sub-tasks in parallel.
  • the sub-task with the error is re-executed in the execution queue PQ.
  • the sub-task with errors is further divided into multiple sub-tasks for re-execution, which further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-task, the time and processing resources for correcting the error are greatly reduced.
  • the allocation queue can be the communication queue used in the deep learning framework, such as the dedicated communication queue (Comm_queue) used in Tensorflow and Pytorch, and the execution queue can be the execution queue in the communication library, for example, it can be The internal execution queue (Internal_queue) in the NCCL communication library.
  • the dedicated communication queue Common_queue
  • the execution queue can be the execution queue in the communication library, for example, it can be The internal execution queue (Internal_queue) in the NCCL communication library.
  • the apparatus includes: a dividing unit M510 configured to divide a total task in the task queue into a plurality of sub-tasks, each sub-task is in In different sub-task queues; a sub-task execution unit M520, configured to execute the plurality of sub-tasks in parallel; and an end unit M530, configured to respond to the completion of execution of the sub-tasks, thereby completing the execution of the total task.
  • a dividing unit M510 configured to divide a total task in the task queue into a plurality of sub-tasks, each sub-task is in In different sub-task queues
  • a sub-task execution unit M520 configured to execute the plurality of sub-tasks in parallel
  • an end unit M530 configured to respond to the completion of execution of the sub-tasks, thereby completing the execution of the total task.
  • the present disclosure also provides a chip including the device shown in FIG. 5 .
  • the present disclosure also provides an electronic device including the chip as described above.
  • the present disclosure also provides an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored therein, when the computer-executable instructions are executed by the one or more processors , so that the electronic device executes the method as described above.
  • the present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
  • the technical solutions of the present disclosure can be applied to the field of artificial intelligence, and are implemented as or in an artificial intelligence chip.
  • the chip can exist alone or can be included in a computing device.
  • FIG. 6 shows a combined processing device 600 that includes the aforementioned computing device 602 , a general interconnection interface 604 , and other processing devices 606 .
  • the computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
  • FIG. 6 is a schematic diagram of a combined processing device.
  • Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor.
  • CPU central processing unit
  • GPU graphics processing unit
  • neural network processor a processor that uses neural network to process machine learning data.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • a universal interconnect interface for transferring data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices.
  • the computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device
  • the data in the storage module is transmitted to other processing devices.
  • the structure may further include a storage device 608, and the storage device is respectively connected to the computing device and the other processing device.
  • the storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.
  • the combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption.
  • the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above chip package structure.
  • a board card is provided.
  • the above-mentioned board card may also include other supporting components, including but not limited to: a storage device 704 , an interface device 706 and a control device 708.
  • the storage device is connected to the chip in the chip package structure through a bus, and is used for storing data.
  • the memory device may include groups of memory cells 710 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
  • the interface device is electrically connected to the chip in the chip package structure.
  • the interface device is used to realize data transmission between the chip and an external device 712 (eg, a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used for monitoring the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
  • the present disclosure also discloses an electronic device or device, which includes the above board.
  • Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches , headsets, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
  • the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
  • the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network device). etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • the term “if” may be contextually interpreted as “when” or “once” or “in response to determining” or “in response to detecting”.
  • the phrases “if it is determined” or “if the [described condition or event] is detected” may be interpreted, depending on the context, to mean “once it is determined” or “in response to the determination” or “once the [described condition or event] is detected. ]” or “in response to detection of the [described condition or event]”.
  • FIG. 35 shows a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure, wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one of the plurality of accelerator cards accelerates The card can communicate with another accelerator card through a communication path; the method includes: in operation S3510, establishing a communication task queue, where the communication task queue includes a communication task and an execution state for monitoring the communication task. a status identifier; in operation S3520, establishing a communication task execution queue for performing a communication task between the accelerator cards through a communication path; and, in operation S3530, in response to the execution of the communication task, changing the status identifier to monitor The execution status of the communication task.
  • the accelerator card system in this paper consists of multiple accelerator cards that can communicate with each other. These accelerator cards can be communicatively connected through different communication paths, so that when starting from one accelerator card to another accelerator card, it can pass different Communication paths arrive, thus forming different communication topologies. It should be understood that the connection in the following refers to a communicative connection, that is, each accelerator card can communicate with each other and transmit data.
  • acceleration card system described above may be formed as an acceleration unit, an acceleration assembly, an acceleration device, or the like. It should be understood that although different terms are used in the context depending on the specific scenario, they are essentially systems that include multiple accelerator cards.
  • Fig. 8a is a schematic diagram showing the structure of an acceleration unit in an embodiment disclosed.
  • the accelerator card system may include an acceleration unit, and the acceleration unit may include M local accelerator cards, each local unit accelerator card including an internal port, each local unit accelerator card passing through the internal The connecting port is connected to other accelerator cards of this unit, wherein the M accelerator cards of this unit are logically formed into an accelerator card matrix of L*N scale, and L and N are integers not less than 2.
  • an accelerator card matrix can be formed by a plurality of accelerator cards, and the accelerator cards are connected to each other, so that data or instructions can be transmitted and communicated.
  • the accelerator cards MC00 to MCON form the 0th row of the accelerator card matrix
  • the accelerator cards MC10 to MC1N form the 1st row of the accelerator card matrix
  • the accelerator cards MCL0 to MCLN form the Lth row of the accelerator card matrix.
  • accelerator cards in the same acceleration unit are referred to as “local unit accelerator cards”, and the accelerator cards in other acceleration units are referred to as “external unit accelerator cards”.
  • local unit accelerator cards the accelerator cards in other acceleration units
  • external unit accelerator cards Such terms are only for convenience of description, and do not limit the technical solutions of the present disclosure.
  • Each accelerator card can have multiple ports, and these ports can be connected to the accelerator card of this unit or to the accelerator card of an external unit.
  • the connection ports between the accelerator cards of this unit may be referred to as internal ports
  • the connection ports between the accelerator cards of this unit and the external unit accelerator cards may be referred to as external ports.
  • the external port and the internal port are only for the convenience of description, and the same port may be used for both. This will be described below.
  • M can be any integer
  • the M accelerator cards can be formed into a 1*M or M*1 matrix, or the M matrices can be formed into other types of matrices.
  • the acceleration unit of the present disclosure is not limited to a specific matrix size and form.
  • a single or multiple communication paths may be used to connect. This will be described in detail later.
  • the formed matrix is not necessarily in the form of a matrix in physical space arrangement. It can be in any position, for example, multiple accelerator cards can form a straight line or multiple accelerator cards can be arranged irregularly.
  • the above matrix is only in terms of logic, as long as the connection between the accelerator cards forms a matrix relationship.
  • M may be 4, and thus, 4 accelerator cards of this unit may logically form a 2*2 accelerator card matrix; M may be 9, so that 9 accelerator cards of this unit may It is logically formed into a 3*3 accelerator card matrix; M can be 16, so that 16 accelerator cards of this unit can logically form a 4*4 accelerator card matrix. M can also be 6, so that 6 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; M can also be 8, so that 8 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; Formed into a 2*4 or 4*2 accelerator card matrix.
  • each local-unit accelerator card is connected to at least one other local-unit accelerator card through two paths.
  • two local accelerator cards may be connected through a single communication path, or may be connected through multiple (eg, two) paths, as long as the number of ports is sufficient. Connecting through multiple communication paths is beneficial to ensure the reliability of communication between the acceleration cards, and is helpful to form different topological structures. This will be explained and described in more detail in the examples below.
  • diagonal local accelerator cards located at four corners in the accelerator card matrix are connected by two paths.
  • the connection of accelerator cards on the diagonal is helpful to form two complete communication loops. . This will be explained and described in more detail in the examples below.
  • At least one of the local-unit accelerator cards may include an external port.
  • each acceleration unit may include four local unit accelerator cards, each local unit accelerator card may include six ports, and the four ports of each local unit accelerator card are internal ports, which are used for connecting with other three The remaining two ports of at least one local unit accelerator card are external ports, which are used to connect with the external unit accelerator card.
  • each accelerator card of this unit four ports can be used to connect the accelerator card of this unit, and the remaining two ports can be used to connect the accelerator cards in other acceleration units.
  • These vacant ports can also be idle ports, not connected to any external device, or directly or indirectly connected to other devices or ports.
  • each acceleration unit including four acceleration cards as an example. It should be understood that each acceleration unit may include a greater or lesser number of accelerator cards.
  • the acceleration unit may include four accelerator cards, namely a first accelerator card, a second accelerator card, a third accelerator card and a fourth accelerator card, each of which is provided with an internal port and an external port, Each accelerator card is connected to the other three accelerator cards through the internal port.
  • FIG. 8b is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure.
  • the acceleration unit 800 includes four accelerator cards, which are an accelerator card MC0, an accelerator card MC1, an accelerator card MC2, and an accelerator card MC3.
  • each accelerator card can include an external port and an internal port.
  • the internal port of the accelerator card MC0 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3.
  • the internal port of the accelerator card MC1 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3.
  • the internal ports of the accelerator cards MC2 and MC3 are all connected, and the internal port of the accelerator card MC2 is connected to the internal port of the accelerator card MC3, that is, the internal port of each accelerator card is connected to the internal ports of the other three accelerator cards. All connected.
  • the information exchange among the four accelerator cards can be realized through the interconnection of the internal ports of the four accelerator cards.
  • the embodiment of the present disclosure utilizes the interconnection between the four acceleration cards in the acceleration unit, which can improve the computing capability of the acceleration unit and realize the purpose of processing massive data at high speed, and make the path between each acceleration card and other acceleration cards the shortest, and the communication Lowest latency.
  • the number of accelerator cards in the present disclosure may not be limited to four, but may be other numbers.
  • the number N of accelerator cards is equal to 3, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other two accelerator cards through the internal port, so as to realize three interconnection between accelerator cards.
  • the number N of accelerator cards is equal to 5
  • each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other four accelerator cards through the internal port, so that five The interconnection between acceleration cards increases the computing power of the acceleration unit and realizes high-speed processing of massive data.
  • the number N of accelerator cards is greater than 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to all other accelerator cards through the internal port, so that N Accelerate the interconnection between cards to achieve high-speed processing of massive data.
  • each acceleration card and at least one other acceleration card may be connected through two paths.
  • the first connection method is that each accelerator card can be connected to one of the other three accelerator cards through two paths
  • the second method is that each accelerator card can be connected to Two of the other three accelerator cards are connected by two paths
  • the third way is that each accelerator card can be connected with the other three accelerator cards by two paths, in this case, It cannot be ruled out that there are more ports per accelerator card.
  • FIG. 9 is a schematic structural diagram of an acceleration unit in another embodiment of the present disclosure.
  • each accelerator card and at least one other accelerator card can be connected by two paths, for example, the accelerator card MC0 and the accelerator card MC2 in the figure can be connected by two paths , and the accelerator card MC1 and the accelerator card MC3 can be connected by two paths.
  • the acceleration unit has been exemplarily described above with reference to FIG. 8 and FIG. 9 .
  • the arrangement of the accelerator cards in the acceleration unit may not be limited to the form shown in FIG. 8 and FIG. 9.
  • the four accelerator cards of the acceleration unit may be logically arranged in a quadrilateral arrangement. The following The description will be made in conjunction with FIG. 10 .
  • FIG. 10 is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure.
  • four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards can occupy four vertex positions of the quadrilateral.
  • the lines between the accelerator cards MC0, MC1, MC2 and MC3 are quadrilateral, which makes the arrangement of the lines clearer and facilitates the setting of the lines.
  • the four accelerator cards shown in Figure 10 are arranged in a rectangle or a 2*2 matrix, but this is a logical interconnection diagram. It is drawn in the form of a rectangle for the convenience of description.
  • the specific quadrilateral can be freely set, such as a parallelogram , trapezoid, square, etc.
  • the four accelerator cards can also be arranged arbitrarily.
  • the four accelerator cards are arranged side by side in a line shape, and the order can be MC0, MC1, MC2, MC3.
  • the logical quadrilateral described in this embodiment is exemplary, and in fact, the arrangement shape of multiple accelerator cards can be ever-changing, and the quadrilateral is only one of them. For example, when the number of accelerator cards is five, they can be logically arranged in a pentagon.
  • FIG. 11 is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure.
  • the four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards occupy four vertex positions of the quadrilateral respectively.
  • two paths can be used for connection between the internal port of the accelerator card MC1 and the internal port of the accelerator card MC3, and between the internal port of the accelerator card MC0 and the internal port of the accelerator card MC2 There are two paths to connect. In this way, for the acceleration unit 1100, not only the line setting is convenient, but also the safety is improved.
  • FIG. 12a is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure.
  • the digital marks on each acceleration card represent ports, and each acceleration card may include six ports, namely, port 0, port 1, port 2, port 3, port 4, port 5.
  • port 1, port 2, port 4 and port 5 are internal ports
  • port 0 and port 3 are external ports.
  • the 2 external ports of each accelerator card can be connected to other accelerator units for interconnection between multiple accelerator units.
  • the 4 internal ports of each accelerator card can be used to interconnect with the other three accelerator cards in this accelerator unit.
  • the four accelerator cards may be logically arranged in, for example, a quadrilateral, accelerator card MC0 and accelerator card MC2 may be in a diagonal relationship, port 2 of MC0 is connected to port 2 of MC2, and port 2 of MC0 5 is connected to port 5 of MC2, that is, there can be two links for communication between the accelerator card MC0 and the accelerator card MC2.
  • Accelerator card MC1 and accelerator card MC3 can be in a diagonal relationship.
  • Port 2 of MC1 is connected to port 2 of MC3, and port 5 of MC1 is connected to port 5 of MC3, that is, the connection between accelerator card MC1 and accelerator card MC3 can be There are two links for communication.
  • each accelerator card has two external ports and four internal ports, and two pairs of accelerator cards are in a diagonal relationship, two accelerator cards can be used between the two accelerator cards of each pair of accelerator cards.
  • the internal port is connected to form two links, so it can effectively improve the security and stability of the acceleration unit.
  • the quadrilateral arrangement logically arranged on the four accelerator cards makes the circuit layout of the entire acceleration unit reasonable and clear, and facilitates the wiring operation in each acceleration unit. It should be further noted that, in the interconnection lines between the four accelerator cards as shown in FIG.
  • connection line between port 1 of the accelerator card MC1 and port 1 of MC0, port 2 of the accelerator card MC0 and port 1 of the accelerator card MC0 The connection line between port 2 of MC2, the connection line between port 1 of accelerator card MC2 and port 1 of MC3, and the connection line between port 2 of accelerator card MC3 and port 2 of MC1, these four lines constitute A vertical figure-8 network, as shown in Figure 12b.
  • connection line between port 4 of accelerator card MC1 and port 4 of MC2 For the connection line between port 4 of accelerator card MC1 and port 4 of MC2, the connection line between port 5 of accelerator card MC2 and port 5 of MC0, and the connection between port 4 of accelerator card MC0 and port 4 of MC3
  • the line, the connection line between the port 5 of the accelerator card MC3 and the port 5 of the MC1 form a horizontal figure-8 network, as shown in Figure 12c.
  • Such two fully connected square networks can form a double-ring structure, which has the function of redundancy backup and enhanced system reliability.
  • the accelerator card described in the present disclosure may be a Mezzanine Card (MC card for short), which may be a separate circuit board.
  • the MC card can be equipped with ASIC chips and some necessary peripheral control circuits.
  • the MC card can be connected to the base board through the pin board connector.
  • the power and control signals on the base board can be transmitted to the MC card through the daughter board connector.
  • the internal port and/or the external port described in the present disclosure may be a SerDes port.
  • each MC card can provide 6 bidirectional SerDes ports, each SerDes port has 8 channels and a data transmission rate of 56Gbps, the total bandwidth of each port can be as high as 400Gbps, which can support acceleration cards Mass data exchange with the accelerator card helps the acceleration unit to process massive data at high speed.
  • the SerDes mentioned above is a composite word of the English word Serializer (Serializer) and De-Serializer (De-Serializer), and is called a Serializer.
  • the SerDes interface can be used to build clusters of high-performance processors.
  • the main function of Serdes is to convert multi-channel low-speed parallel signals into serial signals at the sending end, transmit through the transmission medium, and finally re-convert high-speed serial signals into low-speed parallel signals at the receiving end, so it is very suitable for end-to-end long-term long-distance high-speed transmission requirements.
  • the external port in the acceleration card can be connected to the QSFP-DD interface of other acceleration units, wherein the QSFP-DD interface is an optical module interface commonly used in SerDes technology, which can be used in conjunction with cables. for interconnection with other external devices.
  • one acceleration unit may be equipped with 4 accelerator cards, and the interconnection of the 4 accelerator cards may be completed by using a printed circuit board (PCB) wiring.
  • PCB printed circuit board
  • each accelerator card is connected to the other three accelerator cards through the internal port of the accelerator card, and each accelerator card can directly communicate with the other three accelerator cards.
  • This communication architecture is a fully connected quad network topology.
  • the advantage of this fully connected network architecture is that the path between each accelerator card and other accelerator cards is the shortest, and the total number of hops is the smallest. Lowest latency.
  • the present disclosure uses Hop to describe the time delay of the system, and Hop represents the number of hops in communication, that is, the number of times of communication. Hop specifically refers to the shortest path starting from a node and returning to the initial node after traversing all nodes in the network.
  • each ring in the dual-ring structure can separately complete a part of the operation, thereby improving the overall operation efficiency and maximizing the utilization of the topology bandwidth.
  • the present disclosure also discloses an acceleration assembly that may include a plurality of the above-mentioned acceleration units. The following will be combined with Various embodiments of the acceleration assembly are illustratively described.
  • FIG. 13 is a schematic structural diagram of an acceleration component in an embodiment of the present disclosure.
  • the acceleration assembly 1300 may include n above-mentioned acceleration units, in other words, the acceleration card system may be embodied as an acceleration assembly, which includes a plurality of acceleration units, namely acceleration unit A1, acceleration unit A2, acceleration unit A3, ..., acceleration unit An, wherein the acceleration unit A1 and the acceleration unit A2 are connected through an external port, and the acceleration unit A2 and the acceleration unit A3 are connected through an external port, that is, each acceleration unit is connected through the acceleration unit. connected to the external port of the unit.
  • the external port of the accelerator card MC0 in the acceleration unit A1 can be connected with the external port of the accelerator card MC0 in the acceleration unit A2, and the external port of the accelerator card MC0 in the acceleration unit A2 can be connected with the external port of the accelerator card MC0 in the acceleration unit A3.
  • the external port is connected, that is, each acceleration unit is connected through the external port of the acceleration card MC0.
  • connection between the acceleration units in the present disclosure may not be limited to the connection of the external port of the accelerator card MC0, but may also include, for example, the connection of the external port of the accelerator card MC1 and the connection of the external port of the accelerator card MC2.
  • connection mode of the acceleration unit A2 and the acceleration unit A3 may include: the external port of MC0 in A2 is connected to the external port of MC0 in A3, the external port of MC1 in A2 is connected with the external port of MC1 in A3, and the external port of MC2 in A2 is connected.
  • the external port is connected with the external port of MC2 in A3, and the external port of MC3 in A2 is connected with the external port of MC3 in A3.
  • connection between the acceleration unit An-1 and the acceleration unit An can be reached. It should be noted that the above description is exemplary, for example, the connection between different acceleration units may not be limited to the connection of the acceleration card corresponding to the label, and may be set to the connection of the acceleration card with the same label as required.
  • Figure 13 shows n acceleration units, and n is greater than 3, but the number of acceleration units may not be limited to greater than 3 in the figure, but can also be set to, for example, 2 or 3, two acceleration units
  • the connection relationship between the above-mentioned acceleration units A1 and A2 is the same or similar, and the connection relationship between the three acceleration units is the same as or similar to the connection relationship between the above-mentioned acceleration units A1, A2, A3.
  • the structures of the multiple acceleration units in the acceleration assembly may be the same or different.
  • the structures of the multiple acceleration units shown are the same, but in practice, the structures of the multiple acceleration units may be is different.
  • the layout of multiple accelerator cards in some acceleration units is a polygon, and the layout of multiple accelerator cards in some acceleration units is a line. Multiple accelerator cards are connected through two links, etc.
  • Some acceleration units include four accelerator cards, and some acceleration units include three or five accelerator cards, etc., that is, the structure of each acceleration unit can be set separately , the structures of different acceleration units may be the same or different.
  • each accelerator card can also share data through the interconnection between acceleration units while processing data. Since data sharing can directly obtain data, the data propagation path and time are reduced, so it is necessary to improve data efficiency. Processing efficiency plays a significant role.
  • FIG. 14 is a schematic structural diagram of an acceleration component in another embodiment of the present disclosure.
  • the acceleration component 1400 may include n the aforementioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . . , the acceleration unit An, and the acceleration units in the acceleration component 1400 are logically
  • the upper layer may have a multi-layer structure (shown by a dotted line in the figure), each layer may include an acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card in another acceleration unit through an external port.
  • Such a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster.
  • the acceleration unit of each layer may include four acceleration cards, the acceleration units may be logically arranged in a quadrilateral arrangement, and the four acceleration cards are respectively arranged at four vertex positions of the quadrilateral.
  • the acceleration components described above in conjunction with FIG. 14 are exemplary and not limiting.
  • the structures of the multiple acceleration units may be the same or different.
  • the number of layers of the acceleration component can be 2 layers, 3 layers, 4 layers or more than 4 layers, and the number of layers can be freely set as required.
  • the number of connection paths between the two can be 1, 2, 3 or 4.
  • an exemplary description will be made below with reference to FIGS. 15-19 .
  • FIG. 15 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure.
  • the number of acceleration units in the acceleration component 1401 can be 2, and the two acceleration units are connected through a path.
  • the external port of the acceleration card MC0 in the acceleration unit A1 can be connected to the acceleration unit.
  • the external port of the acceleration card MC0 in the unit A2 is connected to realize the information exchange between the acceleration unit A1 and the acceleration unit A2.
  • the number of acceleration units in the acceleration component 1402 can be two, and the two acceleration units are connected through two paths.
  • the external port of the acceleration card MC0 in the acceleration unit A1 is connected to the acceleration unit in the acceleration unit A2.
  • the external port of the card MC0 is connected, and the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2. In this way, when one of the paths fails, there is another line to support the communication between the acceleration units, further improving the safety of the acceleration components.
  • FIG. 17 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure.
  • the number of acceleration units can be 2, and the two acceleration units are connected through three paths.
  • the external port of the accelerator card MC0 in the acceleration unit A1 and the accelerator card in the acceleration unit A2 The external port of MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the external port of the accelerator card MC2 in the acceleration unit A2. connected to an external port.
  • the number of acceleration units can be 2, and the two acceleration units are connected through three paths.
  • the external port of the accelerator card MC0 in the acceleration unit A1 and the accelerator card in the acceleration unit A2 The external port of MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the
  • FIG. 18 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure.
  • the number of acceleration units can be 2, and the two acceleration units can be connected through four paths, for example, the external port of the acceleration card MC0 in the acceleration unit A1 and the external port of the acceleration unit A2
  • the external port of the accelerator card MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the accelerator card in the acceleration unit A2.
  • the external port of the MC2 is connected, and the external port of the accelerator card MC3 in the acceleration unit A1 is connected with the external port of the accelerator card MC3 in the acceleration unit A2. In this way, even when three of the paths fail, there is another path to support communication between the acceleration units, further improving the safety of the acceleration components.
  • Figure 19a is a schematic diagram of the acceleration components represented as a network topology.
  • the acceleration component 1405 may include two acceleration units, each acceleration unit may include four acceleration cards, and there may be two links between the acceleration card MC1 and the acceleration card MC3 in each acceleration unit, There may be two links between the accelerator card MC0 and the accelerator card MC2.
  • the acceleration device 1405 in the left figure of FIG. 19a can form the stereoscopic representation shown in the right figure.
  • the circles in the right figure of Figure 19a represent accelerator cards, and the lines represent link connections.
  • the number 0 in the circle represents the accelerator card MC0
  • the number 1 represents the accelerator card MC1
  • the number 2 represents the accelerator card MC2
  • the number 3 represents the accelerator card MC3.
  • the figure on the right still shows the acceleration component 1405, just as another form of expression, that is, the form of the network topology is shown.
  • the numbers embedded in the vertical lines in the right figure represent the connected port numbers.
  • port 0 is used for connection between MC0 in two acceleration units
  • port 0 is used for connection between MC1
  • port 3 is used for connection between MC2.
  • one acceleration unit is regarded as a node, and two nodes have 8 accelerator cards, that is, two nodes constitute a so-called 8-card interconnection.
  • the interconnection relationship of one machine and four cards inside each node is certain.
  • MC0 and MC1 in the upper node ie acceleration unit A1
  • MC2 and MC3 of the upper node are connected to MC2 and MC3 of the lower node through port 3 respectively.
  • This node topology is called a hybrid three-dimensional network topology (Hybrid Cube Mesh), that is, the acceleration component 1405 is a hybrid three-dimensional network. topology.
  • the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 5, the accelerator cards MC0 and MC2 are connected via their respective internal ports 5, and the accelerator cards MC2 and MC3 are connected via their respective internal ports 1 connection; while the accelerator card MC1 in the acceleration unit A1 and the accelerator card MC1 in the acceleration unit A2 are connected through their respective external single ports 0, and the accelerator card MC0 in the acceleration unit A1 and the accelerator card MC0 in the acceleration unit A2 are connected through their respective External port 0 is connected.
  • a separate ring is formed among the 8 cards in FIG. 19 .
  • the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 2, the accelerator cards MC0 and MC2 are connected via their respective internal ports 2, and the accelerator cards MC0 and MC1 are connected via their respective internal ports 1 connection; the accelerator card MC2 in the acceleration unit A1 and the accelerator card MC2 in the acceleration unit A2 are connected through their respective external single ports 3, and the accelerator card MC3 in the acceleration unit A1 and the accelerator card MC3 in the acceleration unit A2 are connected through their respective External port 3 is connected.
  • another independent loop is formed in the 8 cards in FIG. 19 .
  • FIG. 20 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure.
  • the acceleration device 2000 may include n above-mentioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . .
  • the acceleration unit An, and the acceleration units in the acceleration device 2000 are logical There is a multi-layer structure (shown in dotted lines in the figure), where the multi-layers can include odd or even layers, each layer can include an acceleration unit, and the accelerator card of each acceleration unit communicates with another acceleration unit through an external port Accelerator cards in the device are connected to each other, wherein, the acceleration unit A1 and the acceleration unit A2 are connected through an external port, the acceleration unit A2 and the acceleration unit A3 are connected through an external port, and so on. An is connected through an external port.
  • the last acceleration unit can be connected with the first acceleration unit, so that the multiple acceleration units are connected end to end to form a ring structure, for example, the external port of the acceleration card MC0 of the acceleration unit An in the figure and the acceleration card of the acceleration unit A1. Connect to the external port of MC0.
  • a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster.
  • connection relationship of the acceleration unit in the acceleration device in the present disclosure has various situations, which have been described in detail above. For details, please refer to, for example, the description of the connection relationship of the acceleration unit in FIG. No longer.
  • the last acceleration unit is connected to the first acceleration unit, which may specifically include: the external port of MC0 in the acceleration unit A1 is connected to the external port of MC0 in An, and the external port of MC1 in the acceleration unit A1 is connected to One or more connection methods of connecting the external port of MC1 in An, connecting the external port of MC2 in the acceleration unit A1 with the external port of MC2 in An, and connecting the external port of MC3 in the acceleration unit A1 with the external port of MC3 in An.
  • FIG. 21 and FIG. 22 are various embodied forms of the acceleration device 2000 shown in FIG. 20 . Therefore, the acceleration device 2000 shown in FIG. The relevant description of can also be applied to the acceleration device in FIG. 21 and FIG. 22 .
  • FIG. 21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment.
  • the acceleration device 2001 shown in FIG. 21 can be composed of four acceleration units.
  • the circles represent accelerator cards, and the lines represent link connections.
  • the number 0 in the circle represents the accelerator card MC0, the number 1 represents the accelerator card MC1, and the number 2 represents the acceleration card.
  • Card MC2, the number 3 represents the accelerator card MC3; the number embedded in the vertical line in the figure represents the number of the connected port.
  • the last acceleration unit is connected to the first acceleration unit, and the total number of hops is 5 times.
  • Each acceleration unit is a node. Through the interconnection between nodes, 4 nodes and 16 cards can be interconnected.
  • the four acceleration units form a small cluster, which is interconnected internally, which is called a super computing cluster super pod.
  • This topology is the main push form of ultra-large-scale clusters, using high-speed SerDes ports, the total number of hops is 5, and the delay is the lowest.
  • the manageability of the cluster is better, and the robustness is also better.
  • FIG. 22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment.
  • the acceleration device 2002 shown in FIG. 22 has more acceleration units. It can be seen from the illustration that the last acceleration unit of the acceleration device 2002 is connected to the first acceleration unit. According to the acceleration device set in this way, the total number of hops is the number of nodes plus one, that is, the total number of hops is the number of acceleration units plus one.
  • the acceleration device including a plurality of acceleration units is exemplarily described above with reference to FIGS. 20-22 .
  • an acceleration device that can include a plurality of the aforementioned acceleration components is also provided. Examples are described in detail.
  • FIG. 23 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure.
  • the acceleration system of the present disclosure may be implemented as an acceleration device.
  • the acceleration device 3000 may include m aforesaid acceleration components.
  • each acceleration component in addition to the external ports within the acceleration component that need to be connected between the acceleration units, there are also idle external ports, and the acceleration components communicate with each other through the idle external ports. Connection, wherein, the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B1 can be connected with the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2, and the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2 can be connected.
  • the port can be connected to the external port of the acceleration card MC1 of the acceleration unit A1 in the acceleration component B3, and so on, and multiple acceleration components are connected to each other.
  • the acceleration device shown in FIG. 23 is exemplary and not limiting, for example, the structures of the multiple acceleration components may be the same or different.
  • the manner of connecting different acceleration components through idle external ports may not be limited to the manner shown in FIG. 23 , and may also include other manners. For ease of understanding, an exemplary description will be made below with reference to FIGS. 24-32 .
  • FIG. 24 is a schematic diagram of a network topology corresponding to the acceleration device in another embodiment
  • the acceleration device 3001 may include two acceleration components, and the acceleration component B1 may include four acceleration units , the acceleration component B2 may include four acceleration units, the first acceleration unit in the acceleration component B1 is connected with the first acceleration unit in the acceleration component B2, the last acceleration unit in the acceleration component B1 is connected with the last acceleration unit in the acceleration component B2 connected.
  • the total number of hops in this network topology is 9.
  • the network structure composed of multiple acceleration units in each acceleration component in FIG. 24 is logical, and the arrangement positions of multiple acceleration units can be adjusted as required in practical applications.
  • the number of acceleration units in each acceleration assembly may not be limited to the four shown in the figure, and may be set more or less as required, for example, six, eight, etc. may be set.
  • the acceleration device 3002 may include four acceleration components, namely, acceleration components B1 , B2 , B3 and B4 .
  • each acceleration assembly may include two acceleration units A1 and A2, and each acceleration assembly may be interconnected with one of the other acceleration units A1 and A2 through one of the acceleration units A1 and A2.
  • the acceleration unit A1 in the acceleration component B1 is connected to the acceleration unit A1 in the acceleration component B2
  • the acceleration unit A1 in the acceleration component B2 is connected with the acceleration unit A1 in the acceleration component B3
  • the acceleration unit A1 in the acceleration component B3 is connected. It is connected to the acceleration unit A1 in the acceleration component B4, and the connections here are all connected through the external port of the acceleration unit.
  • connection between the acceleration components may specifically include: the acceleration unit A1 or A2 in the acceleration component B1 is connected with the acceleration unit A1 or A2 in the acceleration component B2, and the acceleration unit A1 or A2 in the acceleration component B2 is connected with the acceleration component.
  • the acceleration unit A1 or A2 in B3 is connected, and the acceleration unit A1 or A2 in the acceleration assembly B3 is connected with the acceleration unit A1 or A2 in the acceleration assembly B4.
  • each acceleration component can pass through one of the first acceleration unit and the second acceleration unit, and use two paths to communicate with other acceleration components in the first acceleration unit and the second acceleration unit by using two paths.
  • the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B1 in the figure and the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B2 can be connected by two paths, and the acceleration in the acceleration component B2 can be connected by two paths.
  • the unit A1 is connected with the acceleration unit A1 in the acceleration component B3 through two paths
  • the acceleration unit A1 in the acceleration component B3 is connected with the acceleration unit A1 in the acceleration component B4 through two paths.
  • connection method between the acceleration components can also include other methods.
  • the acceleration unit A1 or A2 in the acceleration component B1 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B2.
  • the acceleration unit A1 or A2 in the acceleration component B2 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B3, and the acceleration unit A1 or A2 in the acceleration component B3 can use two paths to connect with the acceleration unit A1 or A2.
  • Acceleration unit A1 or A2 in component B4 is connected.
  • FIG. 27 is a schematic diagram of the acceleration device in another embodiment of the present disclosure.
  • the acceleration device 3004 includes four acceleration components, namely, the acceleration component B1, the acceleration component B2, the acceleration component Component B3 and acceleration component B4, each acceleration component includes two acceleration units, and each acceleration unit includes two pairs of acceleration cards.
  • MC0 and MC1 are the first pair of accelerator cards
  • MC2 and MC3 are the second pair of accelerator cards.
  • the second pair of accelerator cards of the acceleration unit A1 of the acceleration component B1 is connected with the second pair of accelerator cards of the acceleration unit A2 of the acceleration component B2; the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B2 is connected with the acceleration component B3
  • the first pair of accelerator cards of unit A1 is connected; the second pair of accelerator cards of acceleration unit A2 of acceleration unit B3 is connected to the second pair of accelerator cards of acceleration unit A1 of acceleration unit B4; the first pair of accelerator unit A1 of acceleration unit B4 is connected
  • the accelerator card is connected to the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B1.
  • FIG. 28 is a schematic diagram of a network topology of another acceleration device.
  • the acceleration device 3005 shown in FIG. 28 is a specific form of the acceleration device 3004 shown in FIG. 27 , so the above related descriptions about the acceleration device 3004 can also be applied to the acceleration device 3005 in FIG. 28 .
  • each acceleration component of the acceleration device 3005 can form a hybrid three-dimensional network unit, and the interconnection relationship within each hybrid three-dimensional network unit can be as shown in the figure, to realize the 8 nodes and 32 cards of the acceleration device 3005. interconnection.
  • the four acceleration components can be interconnected with multiple cards and multiple nodes through, for example, QSFP-DD interfaces and cables, forming a matrix network topology.
  • ports 0 of the accelerator cards MC2 and MC3 of the upper node of the acceleration component B1 in this embodiment may be respectively connected to the accelerator cards MC2 and MC3 of the lower node of the acceleration component B2, and MC0 and MC1 of the lower node of the acceleration component B2
  • the port 3 of the acceleration component B3 can be respectively connected with the MC0 and MC1 of the upper node of the acceleration component B3, and the ports 0 of the MC2 and MC3 of the lower node of the acceleration component B3 can be respectively connected with the MC2 and MC3 of the upper node of the acceleration component B4.
  • Ports 3 of MC0 and MC1 of the upper node can be respectively connected to MC0 and MC1 of the lower node of the acceleration component B1.
  • the interconnection between the hybrid three-dimensional networks set in this way can form two bidirectional ring structures (as described above in conjunction with Fig. 12b, Fig. 12c, Fig. 19b and Fig. 19c), which has the advantages of better reliability and security, etc. And it is suitable for deep learning training and has high computing efficiency.
  • the matrix network topology consisting of 8 nodes in the acceleration device 3005, the total number of Hops is 11 times.
  • the first pair of accelerator cards and the second pair of accelerator cards in different acceleration units in the same acceleration assembly may be indirectly connected.
  • the accelerator cards MC0 and MC1 of the upper-layer acceleration unit in the acceleration component B1 are indirectly connected with the accelerator cards MC2 and MC3 of the lower-layer acceleration unit.
  • FIG. 29 is a schematic diagram of the matrix network topology based on the wireless expansion of the acceleration device.
  • the acceleration device 3006 may include multiple acceleration components, and each acceleration component (shown as a block in the figure) may include multiple acceleration units (a perspective view is not shown, please refer to the structure of the acceleration component in FIG. 28 ) ), each acceleration unit may include, for example, the interconnection of four acceleration cards as shown in the illustration, so the matrix network topology can theoretically expand infinitely.
  • FIG. 30 is a schematic diagram of the acceleration device in another embodiment of the present disclosure.
  • the acceleration device 3008 may include m (m 2 ) acceleration components, and each acceleration component may include n(n 2) acceleration units, and m acceleration components can be connected in a ring.
  • the acceleration unit An of the acceleration component B1 can be connected with the acceleration unit A1 of the acceleration component B2
  • the acceleration unit An of the acceleration component B2 can be connected with the acceleration unit A1 of the acceleration component B3
  • the acceleration component Bm the acceleration component
  • the acceleration unit An of Bm can be connected to the acceleration unit A1 of the acceleration assembly B1, so that the m acceleration assemblies are connected end to end in a ring connection.
  • FIG. 31 is a schematic diagram of the network topology of another acceleration device.
  • the acceleration device 3009 may include 6 acceleration components, each acceleration component may include two acceleration units, and the second acceleration unit of each acceleration component
  • Each acceleration unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 12 nodes and 48 cards, forming a larger matrix network topology.
  • the total Hop under this network topology is 13 times.
  • FIG. 32 is a schematic diagram of a network topology of another acceleration device.
  • the acceleration device 3010 includes 8 acceleration components, each acceleration component includes two acceleration units, and the second acceleration unit of each acceleration component The unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 16 nodes and 64 cards, forming a larger matrix network topology. The total Hop under this network topology is 17 times.
  • Fig. 32 On the basis of Fig. 32, it can be extended vertically to form a super large-scale matrix network such as 20 nodes with 80 cards and 24 nodes with 96 cards. In theory, it can be extended infinitely, and the total number of Hops is the number of nodes plus one. By optimizing the interconnection between nodes, the delay of the entire system can be minimized, and the real-time requirements of the system can be met to the greatest extent while processing massive data.
  • the acceleration device including a plurality of acceleration components has been exemplarily described above with reference to FIGS. 23 to 32.
  • Those skilled in the art can understand that the above description is exemplary rather than limiting, such as the number of acceleration components. , structure, and the connection relationship between acceleration components can be adjusted as needed.
  • Those skilled in the art can also combine the above multiple embodiments to form an acceleration device as required, which is also within the protection scope of the present disclosure.
  • accelerator card matrix fully connected square network (topology), hybrid three-dimensional network (topology), matrix network (topology), etc. described in this disclosure are all logical, and the specific layout can be based on Adjustment is required.
  • the topology disclosed in the present disclosure can also perform data reduction operations.
  • the reduction operation can be performed on each accelerator card, each accelerator unit and in the accelerator device.
  • the specific operation steps can be as follows.
  • the reduction operation process performed in one acceleration unit may include: transferring the data stored in the first acceleration card to the second acceleration card, and comparing the original data in the second acceleration card.
  • the data stored in the second accelerator card and the data received from the first accelerator card are added; then, the result of the addition operation in the second accelerator card is transferred to the third accelerator card, The addition operation is performed again, and so on, until all the data stored in the accelerator card are added, and each accelerator card has received the final operation result.
  • the accelerator card MC0 stores data (0, 0)
  • the accelerator card MC1 stores data (1, 2)
  • the accelerator card MC2 stores data (3, 1).
  • data (2,4) is stored in the accelerator card MC3.
  • the data (0,0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1,2) can be obtained after the addition operation; then, the result (1,2) can be transferred to the accelerator card MC2,
  • the next result (4,3) is obtained; then, the next result (4,3) is transferred to the accelerator card MC3 to obtain the final result (6,7).
  • the final result (6, 7) is continued to be transmitted to each of the accelerator cards MC0, MC1, MC2 and MC3, so that the data (6, 7) are stored in all the accelerator cards, Thus, the reduction operation is completed in one acceleration unit.
  • the acceleration unit shown in FIG. 11 can form two independent rings, and each ring can complete the reduction operation of half of the data, thereby speeding up the operation speed and improving the operation efficiency.
  • the acceleration unit when the above-mentioned acceleration unit performs the reduction operation, it can also realize the concurrent calculation of multiple acceleration cards, thereby speeding up the operation speed.
  • the accelerator card MC0 stores data (0,0)
  • the accelerator card MC1 stores data (1,2)
  • the accelerator card MC2 stores data (3,1)
  • the accelerator card MC3 stores data ( 2,4).
  • Part of the data (0) in the accelerator card MC0 can be transferred to the accelerator card MC1
  • the result (1) is obtained after the addition operation
  • part of the data (2) in the accelerator card MC1 can be transferred to the accelerator card MC2 synchronously.
  • the result (3) is obtained, thereby realizing the concurrent operation of the accelerator cards MC1 and MC2; and so on, the entire protocol operation is completed.
  • the above-mentioned concurrent calculation may further include that a group of acceleration units performs an addition operation first, and then performs a reduction operation on the operation result of this group of acceleration units and the operation result of another group of acceleration units.
  • the accelerator card MC0 stores data (0,0)
  • the accelerator card MC1 stores data (1,2)
  • the accelerator card MC2 stores data (3,1)
  • the accelerator card MC3 stores data ( 2,4)
  • the data in the accelerator card MC0 can be transferred to the accelerator card MC1 for operation to obtain the first set of results (1,2)
  • synchronously or asynchronously the data in the accelerator card MC2 can be transferred to the accelerator card
  • Operations are performed in MC3 to obtain the second set of results (5,5).
  • the first set of results and the second set of results are operated to obtain the final reduction result (6,7).
  • reduction operations may also be performed in acceleration components or acceleration devices. It should be understood that the acceleration device can also be considered as an acceleration component connected end to end.
  • performing a reduction operation in an acceleration component or an acceleration device it may include: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result in each acceleration unit; The second reduction operation is performed on the first reduction result of , to obtain the second reduction result.
  • the first step above has been described above.
  • a local reduction operation can be performed in each acceleration unit first. After the reduction operation in the acceleration unit is completed, the accelerator card in the same acceleration unit will obtain the result of the local reduction operation, which is referred to as the first reduction result here.
  • the first reduction results in all acceleration units may be transferred and added to adjacent acceleration units. Therefore, similar to the reduction operation performed in one acceleration unit, the first acceleration unit transmits the first reduction result to the second acceleration unit, and after the addition operation is performed in the accelerator card of the second acceleration unit, the result is performed. pass and add operations. After the last addition, the final result is passed to each acceleration unit.
  • acceleration components above are not necessarily connected end-to-end, in the case of transmitting the final result to each acceleration unit, it can be conducted in reverse, instead of cyclic transmission as when the acceleration units are connected end-to-end.
  • the technical solution of the present disclosure does not specifically limit how to conduct the final result.
  • the acceleration device may also be configured to perform a reduction operation, including: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result; Perform an intermediate reduction operation on the first reduction results in the multiple acceleration units of the acceleration component to obtain an intermediate reduction result; perform a second reduction operation on the intermediate reduction results in the multiple acceleration components to obtain a second reduction result.
  • the reduction operation may be performed first in the same acceleration unit, which has been described above, and will not be repeated here.
  • a reduction operation can be performed in each acceleration component, so that each acceleration card in each acceleration component can obtain the local reduction result in this acceleration component;
  • the reduction operation is performed in the acceleration device, so that each acceleration card can obtain the global reduction result in the acceleration device.
  • the communication task queue and the communication task execution queue are separated, and distinguishing the communication task queue and the communication task execution queue can enable users to perform tasks such as fault tolerance or retransmission without perception.
  • a communication task may be delivered to any accelerator card in the accelerator card system as an asynchronous task, and a communication task queue may be formed, and the communication task queue and the communication task execution queue may be located on different accelerator cards. Communication tasks in the same communication queue will be executed sequentially. The execution of these communication tasks can be completed by another accelerator card, so that the communication task queue and the communication task execution queue are located on different accelerator cards.
  • the communication tasks can also be divided into multiple communication tasks for execution.
  • the tasks can be executed concurrently. Therefore, after a total communication task is divided into multiple sub-communication tasks and parallelized, the execution efficiency of the communication task will be greatly improved.
  • communication tasks such as Allreduce
  • data can be transmitted from one accelerator card to another through different communication paths. Therefore, when a total communication task is divided into multiple sub-communication tasks executed in parallel, different communication paths to perform these sub-communication tasks.
  • data can be transmitted from port 1 of accelerator card MC1 to port 1 of accelerator card MC0, and then from port 2 of accelerator card MC0 to port 2 of accelerator card MC2, and finally from the accelerator card.
  • the port 1 of the card MC2 is sent to the port 1 of the accelerator card MC3;
  • the data can also be directly transmitted from the No. 2 port of the accelerator card MC1 to the No. 2 port of the accelerator card MC3;
  • data can be transmitted from port 4 of accelerator card MC1 to port 4 of accelerator card MC2, and then from port 5 of accelerator card MC2 to port 5 of accelerator card MC0, and finally from the accelerator card.
  • the 4th port of the card MC0 is sent to the 4th port of the accelerator card MC3;
  • data can also be directly transmitted from port 5 of the accelerator card MC1 to port 5 of the accelerator card MC3.
  • multiple status flags can be set, and these status flags can monitor the execution of communication tasks and also control the execution of other communication tasks.
  • the execution of the communication task will change the state flag, and the change of the state flag will correspondingly change the execution of other communication tasks.
  • the communication task queue can be loaded on any one of the accelerator cards, preferably, the communication task queue can be loaded on the accelerator card with low load. It should be understood that when multiple accelerator cards are involved in computing and communication, the load of each accelerator card may be different, and an accelerator card with a low load may preferably be selected to carry the communication task queue.
  • the accelerator card can receive communication tasks from the host or other accelerator cards, form a queue, and control the execution of each communication task in the queue. This method helps to make full use of accelerator card resources and improve the overall operating efficiency of the system.
  • Fig. 36a shows a flowchart of a method for executing a communication task according to an embodiment of the present disclosure
  • Fig. 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure.
  • the method of the present disclosure further includes: in operation S3610, dividing a total communication task in the communication task queue into a plurality of sub-communication tasks, each sub-communication task be in different communication task execution queues; in operation S3620, execute the plurality of sub-communication tasks in parallel through different communication paths; and in operation S3630, in response to the completion of execution of the sub-communication tasks, so that the total communication task is performed Finished.
  • the communication task queue can receive multiple total communication tasks, such as total communication tasks A, B and C, etc.
  • communication tasks A, B, and C enter the communication task queue LQ, they are serially combined, and the execution order is A, B, and C. That is, when communication task A is executed, communication tasks B and C need to wait.
  • the communication task B can only be executed after the execution of the communication task A is completed; and the communication task C can be executed only after the execution of the communication task B is completed.
  • Such a task execution method cannot make full use of the parallel operation resources of the system, especially when the execution time of a certain communication task is particularly long or the amount of communication data is particularly large, the execution of other communication tasks will be obviously blocked, and the system performance will be affected. .
  • the communication task in the communication task queue LQ can be regarded as a total communication task, and the total communication task is divided into multiple sub-communication tasks executed in parallel, and placed in the communication task execution queue PQ for execution.
  • the execution efficiency of the task can be significantly improved.
  • the total communication task B can be divided into a plurality of sub-communication tasks b1, b2, etc., and two sub-communication tasks b1 and b2 are taken as an example for description here.
  • the number of divided communication tasks may be other numbers, which may be determined according to the topology of the accelerator card system. For example, if there are more communication paths from one accelerator card to another, then the total communication task can be divided into more sub-communication tasks, and vice versa, the total communication task can be divided into fewer sub-communication tasks; or The larger the amount of data involved in the total communication task, the more sub-communication tasks it can be divided into.
  • the communication tasks can be executed in parallel in the communication task execution queues PQ1 and PQ2.
  • the execution of the general communication task B and the sub-communication task needs to meet the following rules: 1. When the general communication task B has not started to be executed, the sub-communication tasks b1 and b2 should also be in the unstarted state; 2. When the general communication task B starts to be executed time, the sub-communication tasks b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the communication task queue LQ need to wait for task B to be executed before they can be executed; 4. When all sub-communication tasks b1 and b2 are executed After the execution is completed, the general communication task B should also be executed.
  • FIG. 37a shows a flowchart of dividing a total communication task in the communication task queue into a plurality of sub-communication tasks according to an embodiment of the present disclosure.
  • dividing a total communication task in the communication task queue into a plurality of sub-communication tasks S3610 includes: in operation S36110, setting a first write flag that allows the total communication task to start executing ; In operation S36120, set the first waiting mark that prohibits described sub-communication task starting to execute; And, at operation S36130, when described first writing mark is not carried out, carry out described first waiting mark to forbid described The sub-communication task starts to execute.
  • Figure 37b shows a schematic diagram of inserting a marker in a queue according to one embodiment of the present disclosure. The specific implementation of FIG. 37a will be described in detail below in conjunction with FIG. 37b.
  • a write flag needs to be set, in other words, a write flag needs to be inserted before the communication task to be executed, which is exemplarily represented as F0, only when the write flag F0 is executed , or when the write flag F0 is changed to allow the execution of the next task, the subsequent communication task B starts to be executed.
  • the write flag can be inserted through an Atomic Operation.
  • the so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.
  • a waiting flag f0 may be inserted before each sub-communication task, and the waiting flag indicates that the execution of the sub-communication task after the flag is prohibited.
  • the first writing mark F0 and the waiting mark f0 in FIG. 37b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark to detect whether the same mark occurs. Change.
  • the insertion positions of the first writing flag F0 and the waiting flag f0 shown in FIG. 37b are only for the convenience of understanding, and are not necessarily inserted into the sub-communication tasks as shown in FIG. 37b.
  • executing the plurality of sub-communication tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-communication tasks in parallel communication tasks.
  • the write flag F0 before the total communication task and the waiting flag f0 show an associated relationship. Only when the write flag F0 allows the execution of the subsequent total communication task, the waiting flag f0 is ended and the corresponding sub-communication task starts to run. The writing flag F0 does not allow the execution of the subsequent total communication task, and the waiting flag f0 also makes the execution of the sub-communication task in a waiting state.
  • Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure.
  • a second waiting flag may be set to prohibit execution of other communication tasks after the general communication task.
  • a second waiting mark can be inserted after the first writing mark.
  • the second waiting mark When the second waiting mark is executed, it indicates that other general communication tasks after the current general communication task need to be in a waiting state. Before the current general communication task is executed, other general communication tasks cannot start to be executed.
  • the general communication task B corresponding to the first write flag F0 starts to be executed, that is, the sub-communication tasks b1 and b2 of the general communication task B are executed. End the waiting state and start execution; after that, when the second waiting flag F1 is executed, other tasks after the general communication task B enter the waiting state, and are not executed when the general communication task B is executed.
  • FIG. 39 shows a schematic diagram of setting the second waiting flag to be modified according to an embodiment of the present disclosure.
  • the second waiting flag F1 is modified until all sub-communication tasks are executed; and in response to all sub-communication tasks being executed, the second The waiting flag F1 is modified into a waiting end flag, so that the execution of the total communication task is completed.
  • each sub-communication task b1 and b2 is executed in the execution queue PQ, and whenever a sub-communication task b1 or b2 is executed, the second waiting flag F1 can be modified accordingly, for example, it can be The second waiting flag F1 is incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-communication tasks are executed.
  • the second waiting flag F1 can be initially set with a target value, and as the sub-communication task b1 or b2 is executed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset value When the target value is reached, it means that all sub-communication tasks b1 and b2 have been executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.
  • the second waiting flag F1 reaches the target value can also be understood as a waiting end flag, which means that the current general communication task B has been executed, and other tasks can be started.
  • the total communication task can be randomly divided into a plurality of sub-communication tasks; the total communication task can be divided into a fixed number of sub-communication tasks; The total communication task may be divided into a number of sub-communication tasks corresponding to the number of processors, etc., according to the number of communication paths.
  • a total communication task in the task queue may be divided into a plurality of sub-communication tasks with equivalent execution time.
  • the above execution time equivalence does not mean that the size of each sub-communication task itself is the same. For example, if the communication speed of each port is 40Gbps, then for 160G data, if one communication path is used to transmit these data, it theoretically takes 4 seconds to complete the data transmission. Therefore, the 160G data can be split into multiple sub-communication tasks, such as 2, 3 or 4. When it is divided into 4 sub-communication tasks, 4 communication paths can be used to transmit data in parallel. In theory, it only takes 1 second to complete 160G data transmission, and the communication time is only 25% of the original communication time. . Obviously, this will help to shorten the execution time of the total communication task.
  • each communication path in the multiple communication paths does not necessarily have the same transmission speed. Therefore, for the division of communication tasks, the speed of each communication path can be considered to adjust the data corresponding to each sub-communication task. size.
  • the transmission speeds of the four communication paths are 16Gbps, 18Gbps, 22Gbps and 24Gbps respectively, then for 160G data, it can be divided into four sub-communication tasks, respectively 32G, 36G, 44G and 48G,
  • the time for each communication path to complete data transmission is 2 seconds, thereby ensuring that each communication path can complete each communication task at the same time or substantially at the same time.
  • the total communication task may be divided into a plurality of sub-communication tasks. It should be understood that dividing the total communication task into multiple sub-communication tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a total communication task is small, then it is not necessary to Total communication tasks are divided.
  • the method of the present disclosure further includes: in response to an error in one or more sub-communication tasks, re-running the sub-communication task in which the error occurred.
  • the sub-communication task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
  • the sub-communication task with the error can be added to the communication task allocation queue LQ as a new total communication task, and the sub-communication task is further divided into multiple sub-tasks, And re-execute the faulty sub-communication task in multiple parallel execution queues PQ. Dividing the faulty sub-communication task into multiple sub-tasks for re-execution further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-communication task, the time and processing resources for correcting the error are greatly reduced.
  • a communication task when the amount of communication data is large, a communication task can be divided into any number of sub-tasks, and sent to multiple different communication task execution queues for concurrent execution, thereby increasing the bandwidth utilization.
  • a communication logical topology can be flexibly constructed for data communication, and the communication efficiency can be further improved.
  • the accelerator card that delivers the task can be any accelerator card in the accelerator card system.
  • the communication task queue since the communication task queue only has waiting and writing operations, the real communication task is executed in the communication task execution queue, so the communication task queue can correspond to any one of the accelerator card systems, which includes It is beneficial to reduce the probability of programming errors by developers, and an accelerator card with a small amount of tasks can be selected to perform the waiting and write control of the communication task queue.
  • the present disclosure also provides an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored therein, when the computer-executable instructions are executed by the one or more processors , so that the electronic device executes the method as described above.
  • the present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
  • the technical solutions of the present disclosure can be applied to the field of artificial intelligence, and are implemented as or in an artificial intelligence chip.
  • the chip can exist alone or can be included in a computing device.
  • FIG. 33 is a schematic structural diagram of a combined processing apparatus according to an embodiment of the present disclosure.
  • FIG. 33 shows a combined processing device 3300 , which includes the aforementioned acceleration unit 3301 , an interconnection interface 3302 , other processing devices 3303 and a storage device 3304 .
  • the computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user.
  • Figure 33 is a schematic diagram of a combined treatment device.
  • Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor.
  • CPU central processing unit
  • GPU graphics processing unit
  • neural network processor a processor that uses neural network to process machine learning data.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the machine learning computing device and external data and control, including data transfer, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
  • the interconnect interface is used to transfer data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices.
  • the computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device
  • the data in the storage module is transmitted to other processing devices.
  • the structure may further include a storage device 2608, and the storage device is respectively connected to the computing device and the other processing device.
  • the storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.
  • the combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption.
  • the interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • the present disclosure also discloses a chip package structure including the above-mentioned chip.
  • the present disclosure also discloses a board including the above chip package structure.
  • a board 3400 is provided.
  • the above board 3400 may also include other supporting components, including but not limited to: a storage device 3401, an interface device 3407, Control device 3405 and acceleration unit 3406.
  • the storage device is connected to the chip in the chip package structure through a bus, and is used for storing data.
  • the memory device may include groups of memory cells 3402 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
  • the interface device is electrically connected to the chip in the chip package structure.
  • the interface device is used to realize data transmission between the chip and an external device 3408 (eg, a server or a computer).
  • an external device 3408 eg, a server or a computer.
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used for monitoring the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
  • the present disclosure also discloses an electronic device or device, which includes the above board.
  • Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches , headsets, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.
  • the vehicles include airplanes, ships and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
  • the integrated unit if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory.
  • the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network device). etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

A method for executing an asynchronous task and a device. The method can be implemented in a computing apparatus. The computing apparatus can be comprised in a combined processing apparatus. The combined processing can also comprise a universal interconnection interface and an another processing apparatus. The computing apparatus interacts with the another processing apparatus to jointly complete a computing operation specified by a user. The combined processing apparatus can also comprise a storage apparatus. The storage apparatus is separately connected to the computing apparatus and the another processing apparatus, and is used for storing data of the computing apparatus and the another processing apparatus.

Description

一种执行异步任务的方法、设备和计算机程序产品A method, apparatus and computer program product for performing asynchronous tasks
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2020年12月30日申请的,申请号为2020116106707,名称为“一种执行异步任务的方法、设备和计算机程序产品”;于2021年1月15日申请的,申请号为2021100550976,名称为“一种在加速卡系统中执行通信任务的方法和设备”的中国专利申请的优先权。This application requires the application on December 30, 2020, the application number is 2020116106707, and the title is "A method, device and computer program product for performing asynchronous tasks"; The application was filed on January 15, 2021, and the application number is 2021100550976 , the priority of the Chinese patent application entitled "A Method and Apparatus for Performing Communication Tasks in an Accelerator Card System".
技术领域technical field
本公开涉及计算机领域,更具体地,涉及任务的串行和并行执行。The present disclosure relates to the field of computers, and more particularly, to serial and parallel execution of tasks.
背景技术Background technique
在目前的深度网络训练过程中,为了加速网络训练收敛的速度,通常会将一些甚至全部的训练任务(包括计算任务,通信任务,控制逻辑任务等)下发到专门的加速芯片中去执行(比如GPU,MLU,TPU等)。In the current deep network training process, in order to accelerate the speed of network training convergence, some or even all training tasks (including computing tasks, communication tasks, control logic tasks, etc.) are usually sent to special acceleration chips for execution ( Such as GPU, MLU, TPU, etc.).
网络训练任务会以异步形式被CPU下发给加速卡执行,加速卡中会有任务队列的概念,相同队列上的任务会按下发顺序依次执行,因此同一队列上的任务存在依赖关系,不同队列上的任务可以根据硬件资源的空闲情况并发执行。而目前的训练任务通常仅会下发在一个队列中执行,这不可避免地会影响任务的执行效率。The network training task will be sent by the CPU to the accelerator card for execution in an asynchronous form. The accelerator card has the concept of a task queue. The tasks on the same queue will be executed in the order in which they are issued. Therefore, the tasks on the same queue have dependencies and different tasks. Tasks on the queue can execute concurrently based on the availability of hardware resources. However, the current training tasks are usually only executed in one queue, which will inevitably affect the execution efficiency of the tasks.
目前主流的框架(比如Tensorflow,Pytorch)只使用一个专门的通信队列(comm_queue)去执行通信任务。当负责通信任务的通信库获取到任务时,通常会将这个任务直接下发在框架的comm_queue或是通信库内部的任务队列(internal_queue)中执行,比如负责GPU之间通信的通信库NCCL。目前通信任务都会下在一个队列中执行,且当通信任务出现错误的情况下,需要将通信任务从头重新执行一遍,从而降低了整体的通信效率。此外,现有技术中无法实现用户在无感知的情况下进行通信任务容错或重传的操作。The current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks. When the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs. Currently, the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency. In addition, in the prior art, it is impossible for the user to perform the operation of fault tolerance or retransmission of the communication task without the user's perception.
发明内容SUMMARY OF THE INVENTION
2020116106707本公开的一个目的是克服现有技术中不能充分利用通信或运算资源,并且容错能力较低的缺陷。2020116106707 One purpose of the present disclosure is to overcome the defects in the prior art that communication or computing resources cannot be fully utilized and fault tolerance is low.
根据本公开的第一方面,提供一种执行异步任务的方法,包括:将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;并行地执行所述多个分任务;响应于所述分任务执行完毕,从而使得所述总任务执行完毕。According to a first aspect of the present disclosure, there is provided a method for executing an asynchronous task, comprising: dividing a total task in a task queue into a plurality of sub-tasks, each sub-task being in a different sub-task queue; executing all the tasks in parallel The plurality of sub-tasks are completed; in response to the completion of the execution of the sub-tasks, the execution of the total task is completed.
根据本公开的第二方面,一种执行异步任务的装置,包括:划分单元,配置为将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;分任务执行单元,配置为并行地执行所述多个分任务;结束单元,配置为响应于所述分任务执行完毕,从而使得所述总任务执行完毕。According to a second aspect of the present disclosure, an apparatus for executing asynchronous tasks includes: a dividing unit configured to divide a total task in a task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; The sub-task execution unit is configured to execute the plurality of sub-tasks in parallel; the end unit is configured to complete the execution of the total task in response to completion of the sub-task execution.
根据本公开的第三方面,提供一种芯片,包括如上所述的装置。According to a third aspect of the present disclosure, there is provided a chip comprising the apparatus as described above.
根据本公开的第四方面,提供一种电子设备,包括如上所述的芯片。According to a fourth aspect of the present disclosure, there is provided an electronic device including the chip as described above.
根据本公开的第五方面,提供一种电子设备,包括:一个或多个处理器;以及存储器, 所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.
根据本公开第六方面,提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.
本公开的技术方案能够将一个总任务分配到不同的分任务队列中,从而加速总任务的执行。此外,即使某一个分任务队列的执行出现错误,也无需重新执行全部分任务,从而降低任务的容错或重传的代价,减轻任务执行的负担,并且可以在用户不感知的情况下,实现任务的容错或重传处理。2020116106707The technical solution of the present disclosure can allocate a total task to different sub-task queues, thereby accelerating the execution of the total task. In addition, even if there is an error in the execution of a sub-task queue, there is no need to re-execute all tasks, thereby reducing the cost of task fault tolerance or retransmission, reducing the burden of task execution, and realizing tasks without the user's perception. fault tolerance or retransmission processing. 2020116106707
2021100550976本公开的一个目的是克服现有技术中不能将一个任务下发到多个队列中并行执行的缺陷,不能充分利用通信或运算资源,并且容错能力较低的缺陷。2021100550976 One purpose of the present disclosure is to overcome the defects in the prior art that a task cannot be issued to multiple queues for parallel execution, communication or computing resources cannot be fully utilized, and fault tolerance is low.
根据本公开的第一方面,提供一种在加速卡系统中执行通信任务的方法,其中,所述加速卡系统包括多个能够互相通信的加速卡,所述多个加速卡中的一个加速卡能够通过通信路径与另一个加速卡进行通信;所述方法包括:建立通信任务队列,所述通信任务队列中包括通信任务和用于对所述通信任务的执行状态进行监控的状态标识;建立通信任务执行队列,用于通过通信路径在加速卡之间执行通信任务;响应于所述通信任务的执行,改变所述状态标识符以监控所述通信任务的执行状态。According to a first aspect of the present disclosure, there is provided a method for performing a communication task in an accelerator card system, wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one accelerator card among the plurality of accelerator cards can communicate with another accelerator card through a communication path; the method includes: establishing a communication task queue, the communication task queue includes a communication task and a state identifier for monitoring the execution state of the communication task; establishing communication The task execution queue is used for executing communication tasks between the acceleration cards through the communication path; in response to the execution of the communication tasks, the state identifier is changed to monitor the execution state of the communication tasks.
根据本公开第二方面,提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors The processor, when executed, causes the electronic device to perform the method as described above.
根据本公开第三方面,提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
本公开的技术方案的至少一个有益效果在于区分通信任务队列和通信任务执行队列,实现了用户在无感知的情况下进行任务的容错或重传等操作。本公开的技术方案还能够将一个总通信任务分配到不同的分通信任务队列中,从而加速总通信任务的执行。此外,即使某一个分通信任务队列的执行出现错误,也无需重新执行全部分通信任务,从而减轻任务执行的负担。2021100550976At least one beneficial effect of the technical solutions of the present disclosure is to distinguish the communication task queue and the communication task execution queue, so that the user can perform tasks such as fault tolerance or retransmission without perception. The technical solution of the present disclosure can also allocate one general communication task to different sub-communication task queues, thereby accelerating the execution of the general communication task. In addition, even if an error occurs in the execution of a certain sub-communication task queue, it is not necessary to re-execute all the sub-communication tasks, thereby reducing the burden of task execution. 2021100550976
附图说明Description of drawings
通过参考附图阅读下文的详细描述,本公开示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本公开的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:
图1a示出了根据本公开一个实施方式的执行异步任务的方法流程图;Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure;
图1b示出了根据本公开一个实施方式的任务下发队列和任务执行队列的示意图;Figure 1b shows a schematic diagram of a task issuing queue and a task execution queue according to an embodiment of the present disclosure;
图2a示出了根据本公开一个实施方式的将任务队列中的一个总任务划分为多个分任务的流程图;Figure 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure;
图2b示出了根据本公开一个实施方式的在队列中插入标识的示意图;Fig. 2b shows a schematic diagram of inserting a logo in a queue according to an embodiment of the present disclosure;
图3示出了根据本公开另一个实施方式的队列示意图;FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure;
图4示出了根据本公开一个实施方式的插入第二等待标识被修改的示意图;FIG. 4 shows a schematic diagram of inserting a second waiting flag being modified according to an embodiment of the present disclosure;
图5示出了根据本公开一个实施方式的执行异步任务的装置的示意图;FIG. 5 shows a schematic diagram of an apparatus for executing an asynchronous task according to an embodiment of the present disclosure;
图6示出了一种组合处理装置;Figure 6 shows a combined processing device;
图7示出了一种示例性板卡;Figure 7 shows an exemplary board;
图8a为公开一个实施方式中加速单元结构示意图;8a is a schematic structural diagram of an acceleration unit in an embodiment disclosed;
图8b、图9、图10、图11以及图12a-图12c为本公开实施例的加速单元的多个结构示意图;8b, 9, 10, 11, and 12a-12c are multiple schematic structural diagrams of acceleration units according to embodiments of the present disclosure;
图13-图18为本公开实施例的加速组件的多个结构示意图;FIG. 13-FIG. 18 are schematic structural diagrams of acceleration components according to an embodiment of the present disclosure;
图19a-图19c为加速组件表示成网络拓扑的示意图;19a-19c are schematic diagrams showing acceleration components as network topology;
图20为本公开实施例的包括多个加速单元的加速装置示意图;20 is a schematic diagram of an acceleration device including a plurality of acceleration units according to an embodiment of the present disclosure;
图21为一个实施例中加速装置对应的网络拓扑示意图;21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment;
图22为另一个实施例中加速装置对应的网络拓扑示意图;22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment;
图23-图27为本公开实施例的包括多个加速组件的加速装置的多个示意图;23-27 are multiple schematic diagrams of an acceleration device including multiple acceleration components according to an embodiment of the present disclosure;
图28为又一种加速装置的网络拓扑示意图;28 is a schematic diagram of a network topology of another acceleration device;
图29为基于加速装置无线扩展的矩阵网络拓扑示意图;FIG. 29 is a schematic diagram of a matrix network topology based on the wireless extension of an acceleration device;
图30为本公开又一个实施例中加速装置示意图;FIG. 30 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure;
图31为又一种加速装置的网络拓扑示意图;31 is a schematic diagram of a network topology of another acceleration device;
图32为又一种加速装置的网络拓扑示意图;32 is a schematic diagram of a network topology of another acceleration device;
图33为本公开一个实施例中组合装置结构示意图;33 is a schematic structural diagram of a combination device according to an embodiment of the disclosure;
图34为本公开一个实施例中板卡的结构示意图;34 is a schematic structural diagram of a board in an embodiment of the disclosure;
图35示出了根据本公开一个实施方式的在加速卡系统中执行通信任务的方法流程图;35 shows a flowchart of a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure;
图36a示出了根据本公开一个实施方式的执行通信任务的方法流程图;Figure 36a shows a flowchart of a method for performing a communication task according to one embodiment of the present disclosure;
图36b示出了根据本公开一个实施方式的任务下发队列和通信任务执行队列的示意图;Figure 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure;
图37a示出了根据本公开一个实施方式的将任务队列中的一个总任务划分为多个分任务的流程图;Figure 37a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure;
图37b示出了根据本公开一个实施方式的在队列中插入标识的示意图;Figure 37b shows a schematic diagram of inserting a logo in a queue according to one embodiment of the present disclosure;
图38示出了根据本公开另一个实施方式的队列示意图;以及Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure; and
图39示出了根据本公开一个实施方式的插入第二等待标识被修改的示意图。FIG. 39 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of the present disclosure indicate the presence of the described feature, integer, step, operation, element and/or component, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的 项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should also be further understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。The embodiments of the present disclosure are described above in detail, and specific examples are used herein to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. Meanwhile, any changes or modifications made by those skilled in the art based on the ideas of the present disclosure, based on the specific embodiments and application scope of the present disclosure, all belong to the protection scope of the present disclosure. In conclusion, the contents of this specification should not be construed as limiting the present disclosure.
目前主流的框架(比如Tensorflow,Pytorch)只使用一个专门的通信队列(comm_queue)去执行通信任务。当负责通信任务的通信库获取到任务时,通常会将这个任务直接下发在框架的comm_queue或是通信库内部的任务队列(internal_queue)中执行,比如负责GPU之间通信的通信库NCCL。目前通信任务都会下在一个队列中执行,且当通信任务出现错误的情况下,需要将通信任务从头重新执行一遍,从而降低了整体的通信效率。The current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks. When the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs. Currently, the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency.
在本公开中,通信和计算任务通常会作为异步任务下发到加速芯片(比如GPU,MLU)上不同的任务队列(queue)中去执行,同一个队列上的异步任务会按照任务下发的顺序串行执行,不同队列上的任务可以并发执行。In the present disclosure, communication and computing tasks are usually dispatched as asynchronous tasks to different task queues (queues) on acceleration chips (such as GPU, MLU) for execution, and asynchronous tasks on the same queue will be dispatched according to tasks Sequentially executed serially, tasks on different queues can be executed concurrently.
需要理解的是,虽然上文中以通信任务为例,但本文中的任务并不局限于通信任务,还涉及神经网络的运算或训练等各种任务。It should be understood that although the communication task is used as an example above, the tasks in this paper are not limited to communication tasks, but also involve various tasks such as operation or training of neural networks.
图1a示出了根据本公开一个实施方式的执行异步任务的方法流程图;图1b示出了根据本公开一个实施方式的任务下发队列和任务执行队列的示意图。Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure; Fig. 1b shows a schematic diagram of a task delivery queue and a task execution queue according to an embodiment of the present disclosure.
如图1所示,本公开的方法包括:在操作S110,将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;在操作S120,并行地执行所述多个分任务;以及在操作S130,响应于所述分任务执行完毕,从而使得所述总任务执行完毕。As shown in FIG. 1 , the method of the present disclosure includes: in operation S110, dividing a total task in the task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; in operation S120, executing in parallel the plurality of sub-tasks; and in operation S130, in response to the execution of the sub-tasks being completed, the execution of the total task is completed.
下面结合图1b来详细描述上述的方法。The above method will be described in detail below with reference to FIG. 1b.
在图1b中包括了两种类型的队列,即任务分配队列LQ和任务执行队列PQ,任务下发队列中可以接收多个任务,例如任务A,B和C等等,这些任务A、B和C在进入任务分配队列LQ时,串行地组合在一起,执行顺序为A、B和C,也就是说,在任务A执行时,任务B和C需要等待,而任务B只有等到任务A执行完毕之后才能执行;而任务C需要等待任务B执行完毕之后才能执行。这样的任务执行方式无法充分利用系统的并行运行资源,特别是当某一个任务的执行时间特别长或通信数据量特别大的时候,其他任务的执行会明显受到阻塞,从而系统性能受到影响。In Figure 1b, two types of queues are included, namely the task allocation queue LQ and the task execution queue PQ. The task delivery queue can receive multiple tasks, such as tasks A, B and C, etc. These tasks A, B and When C enters the task allocation queue LQ, it is serially combined, and the execution order is A, B, and C. That is, when task A is executed, tasks B and C need to wait, while task B only waits until task A is executed. It can only be executed after completion; while task C needs to wait for task B to be executed before it can be executed. Such a task execution method cannot make full use of the parallel running resources of the system, especially when the execution time of a certain task is particularly long or the amount of communication data is particularly large, the execution of other tasks will be obviously blocked, and the system performance will be affected.
可以将任务分配队列LQ中的任务视为总任务,并将总任务划分为多个并行执行的分任务,并放置在任务执行队列PQ中执行。当一个总任务被划分为多个分任务并行执行时,可以显著地提升任务的执行效率。The tasks in the task allocation queue LQ can be regarded as a total task, and the total task is divided into multiple sub-tasks executed in parallel, and placed in the task execution queue PQ for execution. When a total task is divided into multiple sub-tasks and executed in parallel, the execution efficiency of the task can be significantly improved.
在本公开中,以总任务B为例,可以将该总任务B划分为多个分任务b1,b2等等,这里以两个分任务b1和b2为例来进行说明。需要说明的是,分任务的划分数量可以是其他数量,这取决于对分任务的执行能力和/或总任务的大小。例如对每个分任务的执行能力较强,那么可以将总任务划分为较少数量的分任务;而在在同等执行能力的情况下,如果某个总任务较大,那么可以将该总任务划分为更多数量的分任务。In the present disclosure, taking the overall task B as an example, the overall task B can be divided into a plurality of sub-tasks b1, b2, etc., and two sub-tasks b1 and b2 are used as an example for description here. It should be noted that the number of divided tasks may be other numbers, which depend on the execution capability of the divided tasks and/or the size of the total tasks. For example, if the execution ability of each sub-task is strong, the total task can be divided into a smaller number of sub-tasks; and in the case of the same execution ability, if a certain total task is larger, then the total task can be divided into a smaller number of sub-tasks; Divide into a greater number of subtasks.
当将总任务B划分为分任务b1和b2,并将这些分任务分别置于不同的执行队列PQ1和PQ2中之后,可以在执行队列PQ1和PQ2中并行地执行这两个分任务b1和b2。When the total task B is divided into sub-tasks b1 and b2, and these sub-tasks are placed in different execution queues PQ1 and PQ2 respectively, the two sub-tasks b1 and b2 can be executed in parallel in the execution queues PQ1 and PQ2 .
总任务B与分任务的执行需要满足如下规则:1、当总任务B还未开始执行时,分任务b1和b2也应当处于未开始状态;2、当总任务B开始执行时,分任务b1和b2也应当开始执行;3、任务分配队列LQ中处于任务B之后的其他任务(例如C)需要等待任务B执行完毕之后才能执行;4、当分任务b1和b2全部执行完毕之后,总任务B也应当执行完毕。The execution of total task B and sub-tasks needs to meet the following rules: 1. When the overall task B has not started to be executed, the sub-tasks b1 and b2 should also be in the unstarted state; 2. When the overall task B starts to be executed, the sub-task b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the task allocation queue LQ need to wait for the completion of task B before they can be executed; 4. When the sub-tasks b1 and b2 are all executed, the total task B should also be completed.
图2a示出了根据本公开一个实施方式的将任务队列中的一个总任务划分为多个分任务的流程图。FIG. 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure.
由此,根据本公开的一个实施方式,将任务队列中的一个总任务划分为多个分任务S110包括:在操作S1110,在所述队列中插入允许所述总任务开始执行的第一写入标识;在操作S1120,在所述分任务队列中插入禁止所述分任务开始执行的第一等待标识;以及,在操作S1130,当所述第一写入标识未被执行时,执行所述第一等待标识以禁止所述分任务开始执行。Thus, according to an embodiment of the present disclosure, dividing a total task in the task queue into a plurality of sub-tasks S110 includes: in operation S1110, inserting a first write in the queue that allows the total task to start executing identification; in operation S1120, insert a first waiting mark that prohibits the start of execution of the sub-task in the sub-task queue; and, in operation S1130, when the first write mark is not executed, execute the first waiting mark A wait flag to inhibit the subtask from starting execution.
图2b示出了根据本公开一个实施方式的在队列中插入标识的示意图。下面结合图2b来详细描述图2a的具体实施方式。FIG. 2b shows a schematic diagram of inserting a marker in a queue according to an embodiment of the present disclosure. The specific implementation of FIG. 2a will be described in detail below in conjunction with FIG. 2b.
首先,为了控制任务B的执行,需要在待执行的任务之前插入一个写入标识,这里示例性表示为F0,只有执行到该写入标识F0的时候,或者当该写入标识F0被改变为允许执行接下来的任务时,后续的任务B才开始执行。而如果没有执行到该写入标识F0,则相应的任务并不开始执行。可以通过原子操作(Atomic Operation)来插入该写入标识。所谓原子操作是指不会被线程调度机制打断的操作;这种操作一旦开始,就一直运行到结束,中间不会有任何上下文切换。First, in order to control the execution of task B, it is necessary to insert a write flag before the task to be executed, which is exemplarily represented as F0, only when the write flag F0 is executed, or when the write flag F0 is changed to When the next task is allowed to be executed, the subsequent task B starts to be executed. And if the writing flag F0 is not executed, the corresponding task does not start executing. The write flag can be inserted through an Atomic Operation. The so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.
相应地,可以在每个分任务之前插入等待标识f0,等待标识表示禁止该标识之后的分任务执行。需要理解的是,图2b中的第一写入标识F0和等待标识f0虽然采用了不同的名称来命名,但写入标识F0和等待标识f0指向同一个标识,以检测该同一个标识是否发生改变。Correspondingly, a waiting flag f0 may be inserted before each subtask, and the waiting flag indicates that the execution of the subtasks after the flag is prohibited. It needs to be understood that although the first writing mark F0 and the waiting mark f0 in FIG. 2b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark, so as to detect whether the same mark occurs. Change.
根据本公开的一个实施方式,并行地执行所述多个分任务包括:响应于所述第一写入标识被执行,关断所述第一等待标识,从而并行地执行所述多个分任务。According to an embodiment of the present disclosure, executing the plurality of sub-tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-tasks in parallel .
写入总任务之前的写入标识F0和分任务之前的等待标识f0呈现关联的关系,只有当写入标识F0允许执行后续的总任务时,才结束等待标识f0,并开始运行相应的分任务,而如果写入标识F0不允许执行后续的总任务,则等待标识f0也使得分任务的执行处于等待状态。The writing flag F0 before writing the total task and the waiting flag f0 before the sub-task are related. Only when the writing flag F0 allows the execution of the subsequent total task, the waiting flag f0 is ended and the corresponding sub-task is started. , and if the writing flag F0 does not allow the execution of the subsequent total tasks, the waiting flag f0 also makes the execution of the sub-tasks in a waiting state.
图3示出了根据本公开另一个实施方式的队列示意图。FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure.
根据本公开的一个实施方式,可以在所述总任务队列中插入第二等待标识,以禁止执行所述总任务之后的其他任务。According to an embodiment of the present disclosure, a second waiting flag may be inserted into the total task queue to prohibit execution of other tasks after the total task.
如图3所示,在总任务队列中,可以在第一写入标识后插入第二等待标识,当执行到该第二等待标识时,则表明该当前总任务之后的其他总任务需要处于等待状态,在该当前总任务没有执行完毕之前,其他总任务不能开始执行。As shown in Figure 3, in the overall task queue, a second waiting mark can be inserted after the first writing mark, and when the second waiting mark is executed, it indicates that other overall tasks after the current overall task need to be waiting state, other total tasks cannot start to execute until the current total task is executed.
根据上面的描述可以看出,当在分配队列中执行到第一写入标识F0时,该第一写入标识F0所对应的总任务B开始执行,即,该总任务B的分任务b1和b2结束等待状态并开始执行;此后,执行到分配队列中的第二等待标识F1时,则该分配队列中总任务B之后的其他任务进入等待状态,在执行总任务B的时候并不执行。It can be seen from the above description that when the first write flag F0 is executed in the allocation queue, the total task B corresponding to the first write flag F0 starts to be executed, that is, the sub-tasks b1 and b2 ends the waiting state and starts execution; after that, when the second waiting flag F1 in the allocation queue is executed, other tasks after the total task B in the allocation queue enter the waiting state, and are not executed when the total task B is executed.
图4示出了根据本公开一个实施方式的插入第二等待标识被修改的示意图。FIG. 4 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.
根据本公开的一个实施方式,每当一个分任务执行完毕,则修改所述第二等待标识F1,直至所有分任务执行完毕;以及响应于所有分任务执行完毕,将所述第二等待标识F1修改为等待结束标识,从而使得所述总任务执行完毕。According to an embodiment of the present disclosure, each time one sub-task is executed, the second waiting flag F1 is modified until all sub-tasks are executed; and in response to all sub-tasks being executed, the second waiting flag F1 is changed. The modification is to wait for the end flag, so that the execution of the total task is completed.
接下来,如图4所示,在执行队列PQ中开始执行每个分任务b1和b2,每当一个分任务b1或b2执行完毕,则可以相应地修改第二等待标识F1,例如可以使得第二等待标识F1加一。第二等待标识F1修改的次数与分任务被执行完毕的次数相同。因此,第二等待标识F1可以在初始的时候设定一个目标值,随着分任务b1或b2的执行完毕,第二等待标识F1逐渐接近该目标值,当第二等待标识F1达到预设目标值的时候,则意味着所有的分任务b1和b2执行完毕。需要理解的是,对第二等待标识F1的修改方式可以有很多种,而并不局限于如上文所述的“加一”,例如每修改一次可以减去一,直至该第二等待标识F1小于预定的阈值。本公开对于如何修改第二等待标识并不做任何限定。Next, as shown in FIG. 4 , each subtask b1 and b2 is executed in the execution queue PQ. Whenever a subtask b1 or b2 is executed, the second waiting flag F1 can be modified accordingly, for example, the first 2 Wait for the flag F1 to be incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-tasks are executed. Therefore, the second waiting flag F1 can be initially set with a target value, and as the execution of the sub-task b1 or b2 is completed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset target When the value is set, it means that all sub-tasks b1 and b2 are executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.
上文所述的“第二等待标识F1达到目标值”也可以理解为一个等待结束标识,这意味着当前总任务B已经执行完毕,可以开始执行其他任务了。The above-mentioned "the second waiting flag F1 reaches the target value" can also be understood as a waiting end flag, which means that the current total task B has been executed, and other tasks can be executed.
在将总任务划分为多个分任务时,可以有多种划分方式,可以将总任务随机地划分为多个分任务;可以将总任务划分为固定数量的分任务;可以根据负责每个执行队列PQ的处理器的数量来将总任务划分为与处理器的数量相应数量的分任务等等。When dividing the total task into multiple sub-tasks, there can be various ways of division. The total task can be randomly divided into multiple sub-tasks; the total task can be divided into a fixed number of sub-tasks; The number of processors in the queue PQ is used to divide the total task into the number of sub-tasks corresponding to the number of processors, and so on.
根据本公开的一个优选实施方式,可以将任务队列中的一个总任务划分为多个执行时间等效的分任务。According to a preferred embodiment of the present disclosure, a total task in the task queue may be divided into multiple sub-tasks with equivalent execution time.
上述的执行时间等效并不意味着每个分任务本身的大小是相同的。例如,对于100M的计算数据,有4个处理核参与运算,理论上每个处理核可以参与25M的运算,这样4个处理核都将在相同的时间内完成运算,从而尽可能地减少总的运算时间。但是,由于某个处理核还参与其他运算工作而导致其处理能力低于其他处理核,那么应当考虑这4个处理核的各自处理能力来分配相应的任务,以使得每个处理核完成运算的时间相同或基本相同,这样将有助于缩短对总任务的总体运行时间。因此,将总任务划分为多个分任务的原则在于根据执行任务的资源的能力来划分,以实现多个资源能够在处理时间上等效。The above execution time equivalence does not imply that each subtask itself is the same size. For example, for 100M of computing data, there are 4 processing cores involved in the operation. In theory, each processing core can participate in 25M operations. In this way, the 4 processing cores will complete the operation in the same time, thereby reducing the total as much as possible. operation time. However, because a certain processing core also participates in other computing work, its processing capability is lower than that of other processing cores, so the respective processing capabilities of the four processing cores should be considered to allocate corresponding tasks, so that each processing core can complete the operation. The time is the same or about the same, which will help reduce the overall run time of the total task. Therefore, the principle of dividing the total task into multiple sub-tasks is to divide the tasks according to the capabilities of the resources that execute the tasks, so that multiple resources can be equivalent in processing time.
根据本公开的一个实施方式,其中,响应于所述总任务的数据量超过特定阈值,将所述总任务划分为多个分任务。需要理解的是,将总任务划分为多个分任务还需要考虑每个任务所涉及的数据总量,如果某个任务涉及的数据总量较小,并且对该总任务的处理时间已经小于将执行该总任务生成的数据传送出去的时间,那么将没有必要再将该总任务进行划分。类似地,如果读取该总任务所需的数据的时间构成瓶颈,即读取数据的时间大于执行该总任务的时间,也无需将该总任务进行进一步的划分。According to an embodiment of the present disclosure, wherein, in response to the data amount of the total task exceeding a certain threshold, the total task is divided into a plurality of sub-tasks. It should be understood that dividing the total task into multiple sub-tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a certain task is small, and the processing time of the total task is less than When the data generated by executing the total task is sent out, it will not be necessary to divide the total task. Similarly, if the time for reading data required by the total task constitutes a bottleneck, that is, the time for reading data is greater than the time for executing the total task, there is no need to further divide the total task.
根据本公开的一个实施方式,本公开的方法进一步包括:响应于一个或多个分任务出现错误,重新运行出现错误的分任务。According to one embodiment of the present disclosure, the method of the present disclosure further includes: in response to one or more sub-tasks having an error, re-running the faulty sub-task.
当在执行队列PQ中执行多个分任务时,有可能发生错误,例如执行过程出现运算结果的错误,出现数据吞吐的错误,出现数据传输的错误等等。在传统的方案中,如果总任务不被划分为多个分任务,一旦任务的执行过程中发生错误,则需要将整个总任务重新执行一次,从而会严重地浪费处理能力,造成系统整体性能的下降。When multiple sub-tasks are executed in the execution queue PQ, errors may occur, such as an error in the operation result during the execution process, an error in data throughput, an error in data transmission, and so on. In the traditional scheme, if the total task is not divided into multiple sub-tasks, once an error occurs during the execution of the task, the entire total task needs to be re-executed, which will seriously waste the processing power and cause the overall performance of the system to deteriorate. decline.
在本公开的方案中,由于多个分任务均处于不同的执行队列中,这些执行队列之间独立运行,互相不发生干涉,因此即使某一个分任务在执行过程中发生错误,也不会影响其 他分任务的执行。因此,如果一个分任务的执行出现错误,那么可以仅重新运行该出现错误的分任务即可,而无需将全部的分任务或者总任务整体重新运行一次。在运行该出现错误的分任务时,其他队列可以处于空闲状态,或者可以同时执行其他分任务。因此,本公开中将一个总任务划分为多个并行的分任务的情况能够提升系统处理资源的利用率,并且提升处理效率。In the solution of the present disclosure, since multiple sub-tasks are in different execution queues, these execution queues run independently and do not interfere with each other, so even if an error occurs in a certain sub-task during the execution process, it will not affect Execution of other sub-tasks. Therefore, if an error occurs in the execution of a sub-task, only the sub-task in which the error occurs can be re-run without re-running all the sub-tasks or the total task as a whole. While the faulty subtask is running, other queues may be idle, or other subtasks may be executing concurrently. Therefore, in the present disclosure, dividing a total task into multiple parallel sub-tasks can improve the utilization rate of system processing resources and improve processing efficiency.
根据本公开的一个实施方式,其中,响应于一个或多个分任务出现错误,将出现错误的分任务进一步拆分为多个子任务以便于并行执行。According to an embodiment of the present disclosure, in response to an error in one or more sub-tasks, the sub-task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
当一个分任务出现错误而需要重新执行时,可以将出现错误的分任务作为一个新的总任务加入到任务分配队列LQ中,并将该分任务进一步划分为多个子任务,并在多个并行的执行队列PQ中重新执行一次该出现错误的分任务。将出现错误的分任务进一步划分为多个子任务重新执行进一步提升了系统的运行效率,使得即使某个分任务的执行发生错误,改正该错误所花费的时间和处理资源都大大降低。When an error occurs in a sub-task and needs to be re-executed, the sub-task with the error can be added to the task allocation queue LQ as a new total task, and the sub-task is further divided into multiple sub-tasks, and the sub-tasks are divided into multiple sub-tasks in parallel. The sub-task with the error is re-executed in the execution queue PQ. The sub-task with errors is further divided into multiple sub-tasks for re-execution, which further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-task, the time and processing resources for correcting the error are greatly reduced.
上述的任务可以有很多种,例如计算操作任务、乘法操作任务、卷积计算任务、权值计算任务、通信任务等等。因此,根据任务的不同,分配队列可以是深度学习框架中采用的通信队列,例如是Tensorflow,Pytorch中采用的专用通信队列(Comm_queue),而执行队列可以是通信库中的执行队列,例如可以是NCCL通信库中的内部执行队列(Internal_queue)。图5示出了根据本公开一个实施方式的执行异步任务的装置的示意图,该装置包括:划分单元M510,配置为将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;分任务执行单元M520,配置为并行地执行所述多个分任务;以及结束单元M530,配置为响应于所述分任务执行完毕,从而使得所述总任务执行完毕。There are many kinds of the above tasks, such as calculation operation tasks, multiplication operation tasks, convolution calculation tasks, weight calculation tasks, communication tasks, and so on. Therefore, depending on the task, the allocation queue can be the communication queue used in the deep learning framework, such as the dedicated communication queue (Comm_queue) used in Tensorflow and Pytorch, and the execution queue can be the execution queue in the communication library, for example, it can be The internal execution queue (Internal_queue) in the NCCL communication library. 5 shows a schematic diagram of an apparatus for executing asynchronous tasks according to an embodiment of the present disclosure, the apparatus includes: a dividing unit M510 configured to divide a total task in the task queue into a plurality of sub-tasks, each sub-task is in In different sub-task queues; a sub-task execution unit M520, configured to execute the plurality of sub-tasks in parallel; and an end unit M530, configured to respond to the completion of execution of the sub-tasks, thereby completing the execution of the total task.
本公开还提供一种芯片,包括如图5所示的装置。The present disclosure also provides a chip including the device shown in FIG. 5 .
本公开还提供一种电子设备,包括如上所述的芯片。The present disclosure also provides an electronic device including the chip as described above.
本公开还提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。The present disclosure also provides an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored therein, when the computer-executable instructions are executed by the one or more processors , so that the electronic device executes the method as described above.
本公开还提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
本公开的技术方案可应用于人工智能领域,实现为或者实现在人工智能芯片中。该芯片可以单独存在,也可以包含在计算装置中。The technical solutions of the present disclosure can be applied to the field of artificial intelligence, and are implemented as or in an artificial intelligence chip. The chip can exist alone or can be included in a computing device.
图6示出了一种组合处理装置600,其包括上述的计算装置602,通用互联接口604,和其他处理装置606。根据本公开的计算装置与其他处理装置进行交互,共同完成用户指定的操作。图6为组合处理装置的示意图。FIG. 6 shows a combined processing device 600 that includes the aforementioned computing device 602 , a general interconnection interface 604 , and other processing devices 606 . The computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user. FIG. 6 is a schematic diagram of a combined processing device.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
通用互联接口,用于在计算装置(包括例如机器学习运算装置)与其他处理装置间传输数据和控制指令。该计算装置从其他处理装置中获取所需的输入数据,写入该计算装置 片上的存储装置;可以从其他处理装置中获取控制指令,写入计算装置片上的控制缓存;也可以读取计算装置的存储模块中的数据并传输给其他处理装置。A universal interconnect interface for transferring data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices. The computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device The data in the storage module is transmitted to other processing devices.
可选的,该结构还可以包括存储装置608,存储装置分别与所述计算装置和所述其他处理装置连接。存储装置用于保存在所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。Optionally, the structure may further include a storage device 608, and the storage device is respectively connected to the computing device and the other processing device. The storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
在一些实施例里,本公开还公开了一种芯片封装结构,其包括了上述芯片。In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip.
在一些实施例里,本公开还公开了一种板卡,其包括了上述芯片封装结构。参阅图7,其提供了一种示例性的板卡,上述板卡除了包括上述芯片702以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件704、接口装置706和控制器件708。In some embodiments, the present disclosure also discloses a board including the above chip package structure. Referring to FIG. 7 , an exemplary board card is provided. In addition to the above-mentioned chip 702 , the above-mentioned board card may also include other supporting components, including but not limited to: a storage device 704 , an interface device 706 and a control device 708.
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元710。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 710 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备712(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 712 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device is electrically connected to the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit, MCU). For example, the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
在一些实施例里,本公开还公开了一种电子设备或装置,其包括了上述板卡。In some embodiments, the present disclosure also discloses an electronic device or device, which includes the above board.
电子设备或装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、 摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches , headsets, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequences. Because certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在本公开所提供的几个实施例中,应该理解到,所公开的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本公开的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network device). etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的方法及其核心思想;同时,对于本领域的一般技术人员,依据本公开的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本公开的限制。2020116106707The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, based on the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limiting the present disclosure. 2020116106707
2021100550976下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。2021100550976 The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of the present disclosure indicate the presence of the described feature, integer, step, operation, element and/or component, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".
下面结合附图来详细描述本公开的具体实施方式。The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
图35示出了根据本公开一个实施方式的在加速卡系统中执行通信任务的方法,其中,所述加速卡系统包括多个能够互相通信的加速卡,所述多个加速卡中的一个加速卡能够通过通信路径与另一个加速卡进行通信;所述方法包括:在操作S3510,建立通信任务队列,所述通信任务队列中包括通信任务和用于对所述通信任务的执行状态进行监控的状态标识;在操作S3520,建立通信任务执行队列,用于通过通信路径在加速卡之间执行通信任务;以及,在操作S3530,响应于所述通信任务的执行,改变所述状态标识符以监控所述通信任务的执行状态。FIG. 35 shows a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure, wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one of the plurality of accelerator cards accelerates The card can communicate with another accelerator card through a communication path; the method includes: in operation S3510, establishing a communication task queue, where the communication task queue includes a communication task and an execution state for monitoring the communication task. a status identifier; in operation S3520, establishing a communication task execution queue for performing a communication task between the accelerator cards through a communication path; and, in operation S3530, in response to the execution of the communication task, changing the status identifier to monitor The execution status of the communication task.
首先,下文中将结合附图来详细描述加速卡系统的各种实施方式。本文的加速卡系统由多个能够互相通信的加速卡构成,这些加速卡之间能够通过不同的通信路径来可通信地连接,从而从一个加速卡出发到达另外一个加速卡时,能够通过不同的通信路径到达,从而形成不同的通信拓扑结构。需要理解的是,下文中的连接均指可通信地连接,即每个加速卡之间可互相进行通信和数据传输。First, various embodiments of the accelerator card system will be described in detail below with reference to the accompanying drawings. The accelerator card system in this paper consists of multiple accelerator cards that can communicate with each other. These accelerator cards can be communicatively connected through different communication paths, so that when starting from one accelerator card to another accelerator card, it can pass different Communication paths arrive, thus forming different communication topologies. It should be understood that the connection in the following refers to a communicative connection, that is, each accelerator card can communicate with each other and transmit data.
此外,上文所述的加速卡系统可以形成为加速单元、加速组件或加速装置等。需要理解的是,在上下文中虽然根据具体场景使用了不同的术语,但他们本质上均为包括了多个加速卡的系统。In addition, the acceleration card system described above may be formed as an acceleration unit, an acceleration assembly, an acceleration device, or the like. It should be understood that although different terms are used in the context depending on the specific scenario, they are essentially systems that include multiple accelerator cards.
图8a为公开一个实施方式中加速单元结构示意图。根据本公开的一个实施方式,所述加速卡系统可以包括一个加速单元,该加速单元可以包括M个本单元加速卡,每个本单元加速卡包括内接端口,每个本单元加速卡通过内接端口与其他的本单元加速卡相连接,其中,M个本单元加速卡在逻辑上形成为L*N规模的加速卡矩阵,L和N为不小于2的整数。Fig. 8a is a schematic diagram showing the structure of an acceleration unit in an embodiment disclosed. According to an embodiment of the present disclosure, the accelerator card system may include an acceleration unit, and the acceleration unit may include M local accelerator cards, each local unit accelerator card including an internal port, each local unit accelerator card passing through the internal The connecting port is connected to other accelerator cards of this unit, wherein the M accelerator cards of this unit are logically formed into an accelerator card matrix of L*N scale, and L and N are integers not less than 2.
如图8a所示,可以通过多个加速卡来形成加速卡矩阵,加速卡之间互相连接,从而能够进行数据或指令的传递和通信。例如加速卡MC00至MC0N形成了加速卡矩阵的第0行,加速卡MC10至MC1N形成了加速卡矩阵的第1行,以此类推,加速卡MCL0至MCLN形成了加速卡矩阵的第L行。As shown in FIG. 8a, an accelerator card matrix can be formed by a plurality of accelerator cards, and the accelerator cards are connected to each other, so that data or instructions can be transmitted and communicated. For example, the accelerator cards MC00 to MCON form the 0th row of the accelerator card matrix, the accelerator cards MC10 to MC1N form the 1st row of the accelerator card matrix, and so on, the accelerator cards MCL0 to MCLN form the Lth row of the accelerator card matrix.
需要理解的是,为了方便上下文的理解,将处于同一个加速单元中的加速卡称为“本单元加速卡”,而将其他加速单元中的加速卡称为“外单元加速卡”。这样的称呼仅仅在于方便描述,而对本公开的技术方案不形成限制。It should be understood that, in order to facilitate the understanding of the context, the accelerator cards in the same acceleration unit are referred to as "local unit accelerator cards", and the accelerator cards in other acceleration units are referred to as "external unit accelerator cards". Such terms are only for convenience of description, and do not limit the technical solutions of the present disclosure.
每个加速卡可以有多个端口,这些端口可以与本单元加速卡进行连接,也可以与外单元加速卡进行连接。在本公开中,本单元加速卡之间的连接端口可以称为内接端口,而本单元加速卡与外单元加速卡之间的连接端口可以称为外接端口。需要理解的是,外接端口和内接端口仅仅是为了方便进行描述,二者可以采用相同的端口。这将在下文中进行描述。Each accelerator card can have multiple ports, and these ports can be connected to the accelerator card of this unit or to the accelerator card of an external unit. In the present disclosure, the connection ports between the accelerator cards of this unit may be referred to as internal ports, and the connection ports between the accelerator cards of this unit and the external unit accelerator cards may be referred to as external ports. It should be understood that the external port and the internal port are only for the convenience of description, and the same port may be used for both. This will be described below.
需要理解的是,M可以是任何整数,可以将M个加速卡形成1*M或者M*1的矩阵,也可以将M个矩阵形成为其他类型的矩阵。本公开的加速单元并不限定具体的矩阵大小和形式。It should be understood that M can be any integer, and the M accelerator cards can be formed into a 1*M or M*1 matrix, or the M matrices can be formed into other types of matrices. The acceleration unit of the present disclosure is not limited to a specific matrix size and form.
更进一步地,加速卡之间,例如本单元加速卡之间,本单元加速卡与外单元加速卡之间,可以通过单条或者多条通信路径来连接。这将在后文中进行详细描述。Further, between the accelerator cards, for example, between the accelerator cards of the unit, and between the accelerator cards of the unit and the external unit accelerator cards, a single or multiple communication paths may be used to connect. This will be described in detail later.
还需要理解的是,在本公开的上下文中,尽管均以矩形网络来描述多个加速卡之间的位置,但实际上,所形成的矩阵在物理空间排列上并不必然是矩阵形态,而是可以处于任何位置,例如多个加速卡可以形成一条直线或者多个加速卡可以不规则排列。上述的矩阵仅仅是逻辑上而言的,只要加速卡之间的连接形成矩阵关系即可。It should also be understood that, in the context of the present disclosure, although rectangular networks are used to describe the positions between multiple accelerator cards, in fact, the formed matrix is not necessarily in the form of a matrix in physical space arrangement. It can be in any position, for example, multiple accelerator cards can form a straight line or multiple accelerator cards can be arranged irregularly. The above matrix is only in terms of logic, as long as the connection between the accelerator cards forms a matrix relationship.
根据本公开的一个实施方式,M可以为4,由此,4个本单元加速卡可以在逻辑上形成为2*2的加速卡矩阵;M可以为9,由此9个本单元加速卡可以在逻辑上形成为3*3的加速卡矩阵;M可以为16,由此16个本单元加速卡可以在逻辑上形成为4*4的加速卡矩阵。M也可以为6,由此6个本单元加速卡可以在逻辑上形成为2*3或3*2的加速卡矩阵;M还可以为8,由此8个本单元加速卡可以在逻辑上形成为2*4或4*2的加速卡矩阵。According to an embodiment of the present disclosure, M may be 4, and thus, 4 accelerator cards of this unit may logically form a 2*2 accelerator card matrix; M may be 9, so that 9 accelerator cards of this unit may It is logically formed into a 3*3 accelerator card matrix; M can be 16, so that 16 accelerator cards of this unit can logically form a 4*4 accelerator card matrix. M can also be 6, so that 6 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; M can also be 8, so that 8 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; Formed into a 2*4 or 4*2 accelerator card matrix.
根据本公开的一个实施方式,每个本单元加速卡与其他至少一个本单元加速卡通过两条路径来连接。According to an embodiment of the present disclosure, each local-unit accelerator card is connected to at least one other local-unit accelerator card through two paths.
在本公开所记载的拓扑结构中,两个本单元加速卡之间可以通过单条通信路径来连接,也可以通过多条(例如两条)路径来连接,只要端口的数量足够即可。通过多条通信路径来连接有利于保障加速卡之间通信的可靠性,并且有利于形成不同的拓扑结构。这将在下文的示例中进行更加详细的解释和描述。In the topology described in the present disclosure, two local accelerator cards may be connected through a single communication path, or may be connected through multiple (eg, two) paths, as long as the number of ports is sufficient. Connecting through multiple communication paths is beneficial to ensure the reliability of communication between the acceleration cards, and is helpful to form different topological structures. This will be explained and described in more detail in the examples below.
根据本公开的一个实施方式,所述加速卡矩阵中处于四个角的对角本单元加速卡之间通过两条路径来连接。对于一个矩阵而言,优选地可以将处于矩阵对角的两对加速卡连接起来,对于某些拓扑结构而言,处于对角线位置的加速卡进行连接有助于形成两条完整的通信回路。这将在下文的示例中进行更加详细的解释和描述。According to an embodiment of the present disclosure, diagonal local accelerator cards located at four corners in the accelerator card matrix are connected by two paths. For a matrix, it is preferable to connect two pairs of accelerator cards on the diagonal of the matrix. For some topological structures, the connection of accelerator cards on the diagonal is helpful to form two complete communication loops. . This will be explained and described in more detail in the examples below.
更具体地,根据本公开的一个实施方式,所述本单元加速卡中的至少一个可以包括外接端口。例如,每个加速单元中可以包括四个本单元加速卡,每个本单元加速卡可以包括六个端口,并且其中每个本单元加速卡的四个端口为内接端口,用于与其他三个本单元加速卡连接;至少一个本单元加速卡的其余两个端口为外接端口,用于与外单元加速卡连接。More specifically, according to an embodiment of the present disclosure, at least one of the local-unit accelerator cards may include an external port. For example, each acceleration unit may include four local unit accelerator cards, each local unit accelerator card may include six ports, and the four ports of each local unit accelerator card are internal ports, which are used for connecting with other three The remaining two ports of at least one local unit accelerator card are external ports, which are used to connect with the external unit accelerator card.
需要理解的是,每个本单元加速卡的六个端口中,可以用四个端口来连接本单元加速卡,而空余的两个端口可以用来连接其他加速单元中的加速卡。这些空余的端口也可以为空闲端口,不连接任何外部设备,也可以直接或间接地与其他设备或端口相连接。It should be understood that, among the six ports of each accelerator card of this unit, four ports can be used to connect the accelerator card of this unit, and the remaining two ports can be used to connect the accelerator cards in other acceleration units. These vacant ports can also be idle ports, not connected to any external device, or directly or indirectly connected to other devices or ports.
出于示例和简化的目的,下文中的加速单元、加速组件加速装置以及电子设备均以每个加速单元包括四个加速卡为例来进行说明。需要理解的是,每个加速单元可以包括更多数量或更少数量的加速卡。For the purpose of example and simplification, the acceleration unit, the acceleration device for acceleration components, and the electronic device hereinafter are all described by taking each acceleration unit including four acceleration cards as an example. It should be understood that each acceleration unit may include a greater or lesser number of accelerator cards.
为方便描述,该加速单元可以包括四个加速卡,即第一加速卡、第二加速卡、第三加速卡和第四加速卡,每个加速卡中均设置有内接端口和外接端口,每个加速卡通过内接端口与其他三个加速卡相连接。For the convenience of description, the acceleration unit may include four accelerator cards, namely a first accelerator card, a second accelerator card, a third accelerator card and a fourth accelerator card, each of which is provided with an internal port and an external port, Each accelerator card is connected to the other three accelerator cards through the internal port.
图8b为本公开一个实施例中加速单元结构示意图。加速单元800包括四个加速卡,这四个加速卡为加速卡MC0、加速卡MC1、加速卡MC2和加速卡MC3。对于四个加速卡,每个加速卡可以包括外接端口和内接端口,加速卡MC0的内接端口与加速卡MC1、MC2和MC3的内接端口均相连接,加速卡MC1的内接端口与加速卡MC2、MC3的内接端口均相连接,加速卡MC2的内接端口与加速卡MC3的内接端口相连接,即每个加速卡的内接端口与其他三个加速卡的内接端口均相连接。通过四个加速卡的内接端口的互连可以实现四个加速卡之间的信息交互。本公开实施例利用加速单元中四个加速卡间的互连,可以提高加速单元的计算能力以及实现高速处理海量数据的目的,并且使得每个加速卡和其它加速卡之间的路径最短,通信延时最低。FIG. 8b is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure. The acceleration unit 800 includes four accelerator cards, which are an accelerator card MC0, an accelerator card MC1, an accelerator card MC2, and an accelerator card MC3. For four accelerator cards, each accelerator card can include an external port and an internal port. The internal port of the accelerator card MC0 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3. The internal port of the accelerator card MC1 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3. The internal ports of the accelerator cards MC2 and MC3 are all connected, and the internal port of the accelerator card MC2 is connected to the internal port of the accelerator card MC3, that is, the internal port of each accelerator card is connected to the internal ports of the other three accelerator cards. All connected. The information exchange among the four accelerator cards can be realized through the interconnection of the internal ports of the four accelerator cards. The embodiment of the present disclosure utilizes the interconnection between the four acceleration cards in the acceleration unit, which can improve the computing capability of the acceleration unit and realize the purpose of processing massive data at high speed, and make the path between each acceleration card and other acceleration cards the shortest, and the communication Lowest latency.
如上文所述,本公开中的加速卡的数量可以不限于四个,而是可以为其他数量。例如,在一个实施例中,加速卡的数目N等于3,每个加速卡中均设置有内接端口和外接端口,每个加速卡通过内接端口与其他两个加速卡相连接,实现三个加速卡间的互连。在另一个实施例中,加速卡的数目N等于5,每个加速卡中均设置有内接端口和外接端口,每个加速卡通过内接端口与其他四个加速卡相连接,实现五个加速卡间的互连,从而提高加速单元的计算能力,并且实现高速的处理海量数据。在又一个实施例中,加速卡的数目N大于5,每个加速卡中均设置有内接端口和外接端口,每个加速卡通过内接端口与其他所有加速卡均相连接,实现N个加速卡间的互连,实现高速的处理海量数据。As described above, the number of accelerator cards in the present disclosure may not be limited to four, but may be other numbers. For example, in one embodiment, the number N of accelerator cards is equal to 3, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other two accelerator cards through the internal port, so as to realize three interconnection between accelerator cards. In another embodiment, the number N of accelerator cards is equal to 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other four accelerator cards through the internal port, so that five The interconnection between acceleration cards increases the computing power of the acceleration unit and realizes high-speed processing of massive data. In yet another embodiment, the number N of accelerator cards is greater than 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to all other accelerator cards through the internal port, so that N Accelerate the interconnection between cards to achieve high-speed processing of massive data.
基于图8b提供的加速单元800,进一步的,每个加速卡与其他至少一个加速卡可以通过两条路径来连接。具体的,可以存在例如三种连接方式:第一种连接方式是每个加速卡可以与另外三个加速卡中的一个加速卡通过两条路径进行连接;第二种方式是每个加速卡可以与另外三个加速卡中的两个加速卡均通过两条路径进行连接;第三种方式是每个加速卡可以与另外三个加速卡均通过两条路径进行连接,在这种情况下,不排除有每个加速卡有更多端口的情况。为了便于理解上述关于两条路径的连接方式,下面将以第一种连接方式为例并结合图9进行示例性的描述。Based on the acceleration unit 800 provided in FIG. 8b, further, each acceleration card and at least one other acceleration card may be connected through two paths. Specifically, there can be, for example, three connection methods: the first connection method is that each accelerator card can be connected to one of the other three accelerator cards through two paths; the second method is that each accelerator card can be connected to Two of the other three accelerator cards are connected by two paths; the third way is that each accelerator card can be connected with the other three accelerator cards by two paths, in this case, It cannot be ruled out that there are more ports per accelerator card. In order to facilitate the understanding of the above-mentioned connection manner of the two paths, the following will take the first connection manner as an example and perform an exemplary description in conjunction with FIG. 9 .
图9为本公开另一个实施例中加速单元结构示意图。如图9所示的加速单元900中,每个加速卡与其他至少一个加速卡可以通过两条路径来连接,例如图示中的加速卡MC0与加速卡MC2之间可以通过两条路径来连接,以及加速卡MC1与加速卡MC3之间可以通过两条路径来连接。根据这样的设置,两个加速卡之间进行信息交互的链路(或称路径)可以有两条,如此,当其中某一条链路发生故障时,两个加速卡之间还有另外一条链路来连接,从而能够有效提高加速单元的安全性。FIG. 9 is a schematic structural diagram of an acceleration unit in another embodiment of the present disclosure. In the acceleration unit 900 shown in FIG. 9 , each accelerator card and at least one other accelerator card can be connected by two paths, for example, the accelerator card MC0 and the accelerator card MC2 in the figure can be connected by two paths , and the accelerator card MC1 and the accelerator card MC3 can be connected by two paths. According to this setting, there can be two links (or paths) for information exchange between two accelerator cards. In this way, when one of the links fails, there is another link between the two accelerator cards. This can effectively improve the security of the acceleration unit.
以上结合图8和图9对根据本公开的加速单元及其多个加速卡之间的连接方式进行了示例性的描述,本领域技术人员应该理解的是,以上描述是示例性的而非限制性的,例如加速单元中的加速卡的排列方式可以不限于图8和图9中所示的形式,在一个实施例中,加速单元的四个加速卡可以在逻辑上布设为四边形排列,下面将结合图10进行描述。The connection between the acceleration unit and its multiple acceleration cards according to the present disclosure has been exemplarily described above with reference to FIG. 8 and FIG. 9 . It should be understood by those skilled in the art that the above description is exemplary rather than limiting. For example, the arrangement of the accelerator cards in the acceleration unit may not be limited to the form shown in FIG. 8 and FIG. 9. In one embodiment, the four accelerator cards of the acceleration unit may be logically arranged in a quadrilateral arrangement. The following The description will be made in conjunction with FIG. 10 .
图10为本公开又一个实施例中加速单元结构示意图。如图10所示的加速单元1000中,四个加速卡MC0、MC1、MC2和MC3在逻辑上可以布设为四边形排列,四个加速卡可以占据四边形的四个顶点位置。加速卡MC0、MC1、MC2和MC3之间的线路呈现四边形,使得线路排列更加清晰,便于设置线路。需要说明的是,图10中示出的四个加速卡呈矩形或2*2矩阵排列,但这是逻辑互联图,为了描述方便才画成矩形的形式,具体四边形可以自由设置,例如平行四边形、梯形、正方形等。在实际的布局布线中,四个加速卡也可以是任意的排列,例如在实际的整机中,四个加速卡是并列排成一字形的,顺序可以是MC0、MC1、MC2、MC3。还需要理解的是,本实施例中所述的逻辑四边形是示例性的,实际上多个加速卡的排布形状可以千变万化,四边形只是其中之一。例如加速卡的数量为五个时,可以逻辑上呈五边形排列。FIG. 10 is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure. In the acceleration unit 1000 shown in FIG. 10 , four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards can occupy four vertex positions of the quadrilateral. The lines between the accelerator cards MC0, MC1, MC2 and MC3 are quadrilateral, which makes the arrangement of the lines clearer and facilitates the setting of the lines. It should be noted that the four accelerator cards shown in Figure 10 are arranged in a rectangle or a 2*2 matrix, but this is a logical interconnection diagram. It is drawn in the form of a rectangle for the convenience of description. The specific quadrilateral can be freely set, such as a parallelogram , trapezoid, square, etc. In the actual layout and wiring, the four accelerator cards can also be arranged arbitrarily. For example, in the actual whole machine, the four accelerator cards are arranged side by side in a line shape, and the order can be MC0, MC1, MC2, MC3. It should also be understood that the logical quadrilateral described in this embodiment is exemplary, and in fact, the arrangement shape of multiple accelerator cards can be ever-changing, and the quadrilateral is only one of them. For example, when the number of accelerator cards is five, they can be logically arranged in a pentagon.
基于图9提供的加速单元900的连接关系,进一步的,参见图11,图11为本公开又一个实施例中加速单元结构示意图。如图11所示的加速单元1100中,四个加速卡MC0、MC1、MC2和MC3在逻辑上可以布设为四边形排列,四个加速卡分别占据四边形的四个顶点位置。如图中进一步示出的,加速卡MC1的内接端口和加速卡MC3的内接端口之间可以采用两条路径进行连接,加速卡MC0的内接端口与加速卡MC2的内接端口之间可以通过两条路径来连接。如此对于加速单元1100,不仅线路设置便捷,并且提升了安全性。Based on the connection relationship of the acceleration unit 900 provided in FIG. 9 , further refer to FIG. 11 , which is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure. In the acceleration unit 1100 shown in FIG. 11 , the four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards occupy four vertex positions of the quadrilateral respectively. As further shown in the figure, two paths can be used for connection between the internal port of the accelerator card MC1 and the internal port of the accelerator card MC3, and between the internal port of the accelerator card MC0 and the internal port of the accelerator card MC2 There are two paths to connect. In this way, for the acceleration unit 1100, not only the line setting is convenient, but also the safety is improved.
图12a为本公开一个实施例中加速单元结构示意图。如图12a所示的加速单元1200中,每个加速卡上的数字标记均表示端口,每个加速卡可以包括六个端口,即端口0、端口1、端口2、端口3、端口4、端口5。其中,端口1、端口2、端口4和端口5为内接端口,端口0和端口3为外接端口。对于四个加速卡MC0、MC1、MC2和MC3,每个加速卡的2个外接端口均可以与其他加速单元相连接,用于多个加速单元之间的互连。每个加速卡的4个内接端口可以用于和本加速单元中的其它三个加速卡互联。FIG. 12a is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure. In the acceleration unit 1200 as shown in FIG. 12a, the digital marks on each acceleration card represent ports, and each acceleration card may include six ports, namely, port 0, port 1, port 2, port 3, port 4, port 5. Among them, port 1, port 2, port 4 and port 5 are internal ports, and port 0 and port 3 are external ports. For the four accelerator cards MC0, MC1, MC2 and MC3, the 2 external ports of each accelerator card can be connected to other accelerator units for interconnection between multiple accelerator units. The 4 internal ports of each accelerator card can be used to interconnect with the other three accelerator cards in this accelerator unit.
如图12a中进一步示出的,四个加速卡在逻辑上可以呈例如四边形排列,加速卡MC0和加速卡MC2可以呈对角关系,MC0的端口2和MC2的端口2相连接,MC0的端口5和MC2的端口5相连接,即加速卡MC0和加速卡MC2两者之间可以有两条链路进行通信。加速卡MC1和加速卡MC3可以呈对角关系,MC1的端口2和MC3的端口2相连接,MC1的端口5和MC3的端口5相连接,即加速卡MC1和加速卡MC3两者之间可以有两条链路进行通信。As further shown in Figure 12a, the four accelerator cards may be logically arranged in, for example, a quadrilateral, accelerator card MC0 and accelerator card MC2 may be in a diagonal relationship, port 2 of MC0 is connected to port 2 of MC2, and port 2 of MC0 5 is connected to port 5 of MC2, that is, there can be two links for communication between the accelerator card MC0 and the accelerator card MC2. Accelerator card MC1 and accelerator card MC3 can be in a diagonal relationship. Port 2 of MC1 is connected to port 2 of MC3, and port 5 of MC1 is connected to port 5 of MC3, that is, the connection between accelerator card MC1 and accelerator card MC3 can be There are two links for communication.
根据这样的设置,由于每个加速卡有两个外接端口和四个内接端口,且呈对角关系的两对加速卡中,每对加速卡的两个加速卡之间均可以采用两个内接端口进行连接以形成两条链路,因此可以有效提升加速单元的安全性和稳定性。并且四个加速卡逻辑上布设的四边形排列,使得整个加速单元的线路布局合理清晰,方便每个加速单元内的布线操作。进一步需要说明的是,如图12b中所示的四个加速卡之间的互连线路中,对于加速卡MC1的端口1与MC0的端口1之间的连接线路、加速卡MC0的端口2与MC2的端口2之间的连接线路、加速卡MC2的端口1与MC3的端口1之间的连接线路、以及加速卡MC3的端口2与MC1的端口2之间的连接线路,这四条线路构成了一个竖着的8字形网络,如图12b所示。对于加速卡MC1的端口4与MC2的端口4之间的连接线路、加速卡MC2的端口5与MC0的端口5之间的连接线路、加速卡MC0的端口4与MC3的端口4之间的连接线路、加速卡MC3的端口5与MC1的端口5之间的连接线路,这四条线路构成了一个横着的8字形网络,如图12c所示。这样的两个全连接方形网络可以形成一个双环结 构,具有冗余备份、增强系统可靠性的作用。According to this setting, since each accelerator card has two external ports and four internal ports, and two pairs of accelerator cards are in a diagonal relationship, two accelerator cards can be used between the two accelerator cards of each pair of accelerator cards. The internal port is connected to form two links, so it can effectively improve the security and stability of the acceleration unit. And the quadrilateral arrangement logically arranged on the four accelerator cards makes the circuit layout of the entire acceleration unit reasonable and clear, and facilitates the wiring operation in each acceleration unit. It should be further noted that, in the interconnection lines between the four accelerator cards as shown in FIG. 12b, for the connection line between port 1 of the accelerator card MC1 and port 1 of MC0, port 2 of the accelerator card MC0 and port 1 of the accelerator card MC0 The connection line between port 2 of MC2, the connection line between port 1 of accelerator card MC2 and port 1 of MC3, and the connection line between port 2 of accelerator card MC3 and port 2 of MC1, these four lines constitute A vertical figure-8 network, as shown in Figure 12b. For the connection line between port 4 of accelerator card MC1 and port 4 of MC2, the connection line between port 5 of accelerator card MC2 and port 5 of MC0, and the connection between port 4 of accelerator card MC0 and port 4 of MC3 The line, the connection line between the port 5 of the accelerator card MC3 and the port 5 of the MC1, these four lines form a horizontal figure-8 network, as shown in Figure 12c. Such two fully connected square networks can form a double-ring structure, which has the function of redundancy backup and enhanced system reliability.
根据本公开的一个实施例,本公开所述的加速卡,可以是Mezzanine Card(简称MC卡),其可以是一块单独的电路板。MC卡上面可以搭载ASIC芯片和一些必要的外围控制电路。MC卡可以和基板通过扣板连接器连在一起。基板上的电源和控制信号可以通过扣板连接器传递给MC卡。根据本公开的另一个实施例,本公开所述的内接端口和/或外接端口可以是SerDes端口。例如在一个实施例中,每一个MC卡可以提供6个双向SerDes端口,每个SerDes端口具备8路通道以及56Gbps的数据传输速率,则每个端口的总带宽可以高达400Gbps,其可以支撑加速卡与加速卡之间进行海量的数据交换,有助于加速单元高速处理海量数据。According to an embodiment of the present disclosure, the accelerator card described in the present disclosure may be a Mezzanine Card (MC card for short), which may be a separate circuit board. The MC card can be equipped with ASIC chips and some necessary peripheral control circuits. The MC card can be connected to the base board through the pin board connector. The power and control signals on the base board can be transmitted to the MC card through the daughter board connector. According to another embodiment of the present disclosure, the internal port and/or the external port described in the present disclosure may be a SerDes port. For example, in one embodiment, each MC card can provide 6 bidirectional SerDes ports, each SerDes port has 8 channels and a data transmission rate of 56Gbps, the total bandwidth of each port can be as high as 400Gbps, which can support acceleration cards Mass data exchange with the accelerator card helps the acceleration unit to process massive data at high speed.
上文中所述的SerDes是英文单词串行器(Serializer)和解串行器(De-Serializer)的合成词,被称为串行解串器。SerDes接口可以被用于构建高性能处理器集群。Serdes的主要功能是在发送端将多路低速并行信号转换为串行信号,经过传输介质的传输,最后在接收端将高速串行信号重新转换成低速并行信号,因此其非常适合端到端的长距离高速传输需求。在另一个实施例中,加速卡中的外接端口可以连接到其他加速单元的QSFP-DD接口上,其中QSFP-DD接口是SerDes技术中常用的一种光模块接口,其与线缆配合使用可用于和其它外部设备互连。The SerDes mentioned above is a composite word of the English word Serializer (Serializer) and De-Serializer (De-Serializer), and is called a Serializer. The SerDes interface can be used to build clusters of high-performance processors. The main function of Serdes is to convert multi-channel low-speed parallel signals into serial signals at the sending end, transmit through the transmission medium, and finally re-convert high-speed serial signals into low-speed parallel signals at the receiving end, so it is very suitable for end-to-end long-term long-distance high-speed transmission requirements. In another embodiment, the external port in the acceleration card can be connected to the QSFP-DD interface of other acceleration units, wherein the QSFP-DD interface is an optical module interface commonly used in SerDes technology, which can be used in conjunction with cables. for interconnection with other external devices.
进一步地,根据本公开的又一个实施例,一个加速单元内部可以搭载4个加速卡,且4个加速卡互联可以采用印制电路板PCB走线完成。在低介电常数的高速板材上,通过合理的布局布线,可以最大程度的保证信号完整性,进而保证四个加速卡之间的通信带宽趋向于理论值。Further, according to yet another embodiment of the present disclosure, one acceleration unit may be equipped with 4 accelerator cards, and the interconnection of the 4 accelerator cards may be completed by using a printed circuit board (PCB) wiring. On high-speed boards with low dielectric constant, through reasonable layout and wiring, signal integrity can be ensured to the greatest extent, thereby ensuring that the communication bandwidth between the four accelerator cards tends to the theoretical value.
本公开公开的加速单元,在加速单元内部,对于四个加速卡,每个加速卡通过该加速卡的内接端口与其他三个加速卡相连接,每一张加速卡都可以直接和另外三个加速卡进行通信,这样的通信架构为全连接方形网络拓扑(fully connected quad),这种全连接网络架构的优势在于每个加速卡和其它加速卡之间的路径最短,总Hop数最小,延时最低。本公开用Hop来描述系统的时延,Hop在通信中表示跳数,即通信次数。Hop具体表示从一个节点出发,遍历完网络中所有的节点后回到初始节点的最短路径。4个加速卡进行互联,所构成的全连接方形网络拓扑延时最短,且其中对角两个加速卡互联构成的双环结构可以提高系统的健壮性,在单个加速卡出现故障时业务仍然能够正常运行。在进行各种算术逻辑运算时,双环结构中的每个环可以分别完成一部分运算,从而提高整体运算效率,并最大化利用拓扑带宽。In the acceleration unit disclosed in the present disclosure, inside the acceleration unit, for four accelerator cards, each accelerator card is connected to the other three accelerator cards through the internal port of the accelerator card, and each accelerator card can directly communicate with the other three accelerator cards. This communication architecture is a fully connected quad network topology. The advantage of this fully connected network architecture is that the path between each accelerator card and other accelerator cards is the shortest, and the total number of hops is the smallest. Lowest latency. The present disclosure uses Hop to describe the time delay of the system, and Hop represents the number of hops in communication, that is, the number of times of communication. Hop specifically refers to the shortest path starting from a node and returning to the initial node after traversing all nodes in the network. 4 accelerator cards are interconnected, forming a fully connected square network topology with the shortest delay, and the double-ring structure formed by the interconnection of two diagonal accelerator cards can improve the robustness of the system, and services can still be normal when a single accelerator card fails run. When performing various arithmetic and logic operations, each ring in the dual-ring structure can separately complete a part of the operation, thereby improving the overall operation efficiency and maximizing the utilization of the topology bandwidth.
以上结合图8a-图12c对根据本公开的加速单元的多个实施例进行了描述,基于上述的加速单元,本公开还公开了一种可以包括多个上述加速单元的加速组件,下面将结合加速组件的多个实施例进行示例性的描述。Multiple embodiments of the acceleration unit according to the present disclosure have been described above with reference to FIGS. 8a to 12c. Based on the above-mentioned acceleration unit, the present disclosure also discloses an acceleration assembly that may include a plurality of the above-mentioned acceleration units. The following will be combined with Various embodiments of the acceleration assembly are illustratively described.
图13为本公开一个实施例中加速组件结构示意图。如图13中所示,加速组件1300可以包括n个上述加速单元,换言之,所述加速卡系统可以体现为一个加速组件,其包括多个加速单元,即加速单元A1、加速单元A2、加速单元A3、...、加速单元An,其中加速单元A1和加速单元A2之间通过外接端口相连接,加速单元A2和加速单元A3之间通过外接端口相连接,即每个加速单元之间通过加速单元的外接端口相连接。在一个实施例中,加速单元A1中加速卡MC0的外接端口可以与加速单元A2中加速卡MC0的外接端口相连,加速单元A2中加速卡MC0的外接端口可以与加速单元A3中加速卡MC0的外 接端口相连,即每个加速单元通过加速卡MC0的外接端口连接。FIG. 13 is a schematic structural diagram of an acceleration component in an embodiment of the present disclosure. As shown in FIG. 13 , the acceleration assembly 1300 may include n above-mentioned acceleration units, in other words, the acceleration card system may be embodied as an acceleration assembly, which includes a plurality of acceleration units, namely acceleration unit A1, acceleration unit A2, acceleration unit A3, ..., acceleration unit An, wherein the acceleration unit A1 and the acceleration unit A2 are connected through an external port, and the acceleration unit A2 and the acceleration unit A3 are connected through an external port, that is, each acceleration unit is connected through the acceleration unit. connected to the external port of the unit. In one embodiment, the external port of the accelerator card MC0 in the acceleration unit A1 can be connected with the external port of the accelerator card MC0 in the acceleration unit A2, and the external port of the accelerator card MC0 in the acceleration unit A2 can be connected with the external port of the accelerator card MC0 in the acceleration unit A3. The external port is connected, that is, each acceleration unit is connected through the external port of the acceleration card MC0.
本领域技术人员可以理解的是,本公开中加速单元之间的连接可以不限于加速卡MC0的外接端口的连接,还可以包括例如加速卡MC1的外接端口的连接、加速卡MC2的外接端口的连接以及加速卡MC3的外接端口的连接中的一种或多种。即本公开中,加速单元A1与加速单元A2的连接方式可以包括:A1中MC0的外接端口与A2中MC0的外接端口相连、A1中MC1的外接端口与A2中MC1的外接端口相连、A1中MC2的外接端口与A2中MC2的外接端口相连、A1中MC3的外接端口与A2中MC3的外接端口相连中的一种或多种连接方式。类似地,加速单元A2与加速单元A3的连接方式可以包括:A2中MC0的外接端口与A3中MC0的外接端口相连、A2中MC1的外接端口与A3中MC1的外接端口相连、A2中MC2的外接端口与A3中MC2的外接端口相连、A2中MC3的外接端口与A3中MC3的外接端口相连中的一种或多种连接方式。以此类推可以到加速单元An-1与加速单元An的连接。需要说明的是,以上描述是示例性的,例如不同加速单元之间的连接可以不限于是标号对应的加速卡的连接,可以根据需要设置为同标号的加速卡的连接。Those skilled in the art can understand that the connection between the acceleration units in the present disclosure may not be limited to the connection of the external port of the accelerator card MC0, but may also include, for example, the connection of the external port of the accelerator card MC1 and the connection of the external port of the accelerator card MC2. One or more of the connection and the connection of the external port of the accelerator card MC3. That is, in the present disclosure, the connection mode of the acceleration unit A1 and the acceleration unit A2 may include: the external port of MC0 in A1 is connected with the external port of MC0 in A2, the external port of MC1 in A1 is connected with the external port of MC1 in A2, and the external port of MC1 in A1 is connected. One or more connection methods in which the external port of MC2 is connected to the external port of MC2 in A2, and the external port of MC3 in A1 is connected to the external port of MC3 in A2. Similarly, the connection mode of the acceleration unit A2 and the acceleration unit A3 may include: the external port of MC0 in A2 is connected to the external port of MC0 in A3, the external port of MC1 in A2 is connected with the external port of MC1 in A3, and the external port of MC2 in A2 is connected. The external port is connected with the external port of MC2 in A3, and the external port of MC3 in A2 is connected with the external port of MC3 in A3. One or more connection methods. By analogy, the connection between the acceleration unit An-1 and the acceleration unit An can be reached. It should be noted that the above description is exemplary, for example, the connection between different acceleration units may not be limited to the connection of the acceleration card corresponding to the label, and may be set to the connection of the acceleration card with the same label as required.
需要说明的是,图13中展示了n个加速单元,n大于3,但是加速单元的个数可以不限于图示中的大于3,还可以设置为例如2个或者3个,两个加速单元之间的连接关系与上述加速单元A1和A2之间的连接关系相同或相似,三个加速单元之间的连接关系与上述加速单元A1、A2、A3之间的连接关系相同或相似,此处不再赘述。It should be noted that Figure 13 shows n acceleration units, and n is greater than 3, but the number of acceleration units may not be limited to greater than 3 in the figure, but can also be set to, for example, 2 or 3, two acceleration units The connection relationship between the above-mentioned acceleration units A1 and A2 is the same or similar, and the connection relationship between the three acceleration units is the same as or similar to the connection relationship between the above-mentioned acceleration units A1, A2, A3. Here No longer.
另外,加速组件中的多个加速单元的结构可以相同,也可以不相同,图13中为了展示方便,展示的多个加速单元的结构是相同的,但是实际中,多个加速单元的结构可以是不同的。例如,有的加速单元中多个加速卡布局是多边形,有的加速单元中多个加速卡布局呈一条线,有的加速单元中多个加速卡之间通过一条线路进行连接,有的加速组件中多个加速卡之间通过两条链路进行连接等,有的加速单元包括四个加速卡,有的加速单元包括三个或五个加速卡等,即每个加速单元的结构可以单独设置,不同加速单元的结构可相同可不同。In addition, the structures of the multiple acceleration units in the acceleration assembly may be the same or different. In FIG. 13, for the convenience of illustration, the structures of the multiple acceleration units shown are the same, but in practice, the structures of the multiple acceleration units may be is different. For example, the layout of multiple accelerator cards in some acceleration units is a polygon, and the layout of multiple accelerator cards in some acceleration units is a line. Multiple accelerator cards are connected through two links, etc. Some acceleration units include four accelerator cards, and some acceleration units include three or five accelerator cards, etc., that is, the structure of each acceleration unit can be set separately , the structures of different acceleration units may be the same or different.
本公开公开的加速组件,不仅加速组件中加速单元内部的加速卡可以进行互连,而且不同的加速单元的加速卡也可以进行互连,从而可以构建混合立体网络。根据这样的设置,每个加速卡进行处理数据的同时,还可以通过加速单元之间的互连来共享数据,由于数据的共享可以直接获取数据,减少了数据传播路径和时间,因此对于提升数据处理效率具有显著作用。In the acceleration assembly disclosed in the present disclosure, not only the acceleration cards inside the acceleration unit in the acceleration assembly can be interconnected, but also the acceleration cards of different acceleration units can be interconnected, so that a hybrid three-dimensional network can be constructed. According to this setting, each accelerator card can also share data through the interconnection between acceleration units while processing data. Since data sharing can directly obtain data, the data propagation path and time are reduced, so it is necessary to improve data efficiency. Processing efficiency plays a significant role.
图14为本公开另一个实施例中加速组件结构示意图。如图14中所示,加速组件1400可以包括n个前述加速单元,即加速单元A1、加速单元A2、加速单元A3、...、加速单元An,加速组件1400中的多个加速单元在逻辑上可以呈多层结构(图示中以虚线示出),每一层可以包括一个加速单元,每个加速单元的加速卡通过外接端口与另一个加速单元中的加速卡相连接。这样层层递进的配置组合使得每个加速卡在高速运算处理数据的同时,可以通过高速串行链路进行数据的共享,实现加速卡的无限互连,以满足可定制的算力需求,实现对处理器集群硬件算力的灵活配置。如图中进一步示出的,每层的加速单元可以包括四个加速卡,加速单元在逻辑上可以布设为四边形排列,四个加速卡分别布设在四边形的四个顶点位置。FIG. 14 is a schematic structural diagram of an acceleration component in another embodiment of the present disclosure. As shown in FIG. 14 , the acceleration component 1400 may include n the aforementioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . . , the acceleration unit An, and the acceleration units in the acceleration component 1400 are logically The upper layer may have a multi-layer structure (shown by a dotted line in the figure), each layer may include an acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card in another acceleration unit through an external port. Such a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster. As further shown in the figure, the acceleration unit of each layer may include four acceleration cards, the acceleration units may be logically arranged in a quadrilateral arrangement, and the four acceleration cards are respectively arranged at four vertex positions of the quadrilateral.
本领域技术人员应该理解的是,以上结合图14所描述的加速组件是示例性的而非限 制性的。例如,多个加速单元的结构可以相同或不同。加速组件的层数可以为2层,3层,4层或者4层以上,层数可以根据需要自由设定。对于每两个相连的加速单元,二者之间的连接路径数量可以为1条、2条、3条或者4条。为了便于理解,以下将结合图15-图19进行示例性的描述。It should be understood by those skilled in the art that the acceleration components described above in conjunction with FIG. 14 are exemplary and not limiting. For example, the structures of the multiple acceleration units may be the same or different. The number of layers of the acceleration component can be 2 layers, 3 layers, 4 layers or more than 4 layers, and the number of layers can be freely set as required. For every two connected acceleration units, the number of connection paths between the two can be 1, 2, 3 or 4. For ease of understanding, an exemplary description will be made below with reference to FIGS. 15-19 .
图15为本公开又一个实施例中加速组件结构示意图。如图15中所示,加速组件1401中加速单元的个数可以为2个,两个加速单元之间通过一条路径进行连接,具体是可以通过例如加速单元A1中加速卡MC0的外接端口与加速单元A2中加速卡MC0的外接端口相连接,可以实现加速单元A1和加速单元A2之间的信息交互。FIG. 15 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. As shown in FIG. 15 , the number of acceleration units in the acceleration component 1401 can be 2, and the two acceleration units are connected through a path. Specifically, for example, the external port of the acceleration card MC0 in the acceleration unit A1 can be connected to the acceleration unit. The external port of the acceleration card MC0 in the unit A2 is connected to realize the information exchange between the acceleration unit A1 and the acceleration unit A2.
如图16中所示,加速组件1402中加速单元的个数可以为2个,两个加速单元之间通过两条路径进行连接,加速单元A1中加速卡MC0的外接端口与加速单元A2中加速卡MC0的外接端口相连接,加速单元A1中加速卡MC1的外接端口与加速单元A2中加速卡MC1的外接端口相连接。如此,当其中一条路径发生故障时,还有另外一条线路支持加速单元之间进行通信,进一步提高加速组件的安全性。As shown in FIG. 16 , the number of acceleration units in the acceleration component 1402 can be two, and the two acceleration units are connected through two paths. The external port of the acceleration card MC0 in the acceleration unit A1 is connected to the acceleration unit in the acceleration unit A2. The external port of the card MC0 is connected, and the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2. In this way, when one of the paths fails, there is another line to support the communication between the acceleration units, further improving the safety of the acceleration components.
下面请参考图17,图17为本公开又一个实施例中加速组件结构示意图。如图17所示的加速组件1403中,加速单元的个数可以为2个,两个加速单元之间通过三条路径进行连接,加速单元A1中加速卡MC0的外接端口与加速单元A2中加速卡MC0的外接端口相连接,加速单元A1中加速卡MC1的外接端口与加速单元A2中加速卡MC1的外接端口相连接,加速单元A1中加速卡MC2的外接端口与加速单元A2中加速卡MC2的外接端口相连接。如此,即使当其中两条路径发生故障时,还有另外一条路径支持加速单元之间进行通信,进一步提高加速组件的安全性。Please refer to FIG. 17 below. FIG. 17 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. In the acceleration component 1403 shown in FIG. 17 , the number of acceleration units can be 2, and the two acceleration units are connected through three paths. The external port of the accelerator card MC0 in the acceleration unit A1 and the accelerator card in the acceleration unit A2 The external port of MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the external port of the accelerator card MC2 in the acceleration unit A2. connected to an external port. In this way, even when two of the paths fail, there is another path to support communication between the acceleration units, further improving the safety of the acceleration components.
下面请参考图18,图18为本公开又一个实施例中加速组件结构示意图。如图18所示的加速组件1404中,加速单元的个数可以为2个,两个加速单元之间可以通过四条路径进行连接,例如加速单元A1中加速卡MC0的外接端口与加速单元A2中加速卡MC0的外接端口相连接,加速单元A1中加速卡MC1的外接端口与加速单元A2中加速卡MC1的外接端口相连接,加速单元A1中加速卡MC2的外接端口与加速单元A2中加速卡MC2的外接端口相连接,加速单元A1中加速卡MC3的外接端口与加速单元A2中加速卡MC3的外接端口相连接。如此,即使当其中三条路径发生故障时,还有另外一条路径支持加速单元之间进行通信,进一步提高加速组件的安全性。Please refer to FIG. 18 below. FIG. 18 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. In the acceleration component 1404 shown in FIG. 18 , the number of acceleration units can be 2, and the two acceleration units can be connected through four paths, for example, the external port of the acceleration card MC0 in the acceleration unit A1 and the external port of the acceleration unit A2 The external port of the accelerator card MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the accelerator card in the acceleration unit A2. The external port of the MC2 is connected, and the external port of the accelerator card MC3 in the acceleration unit A1 is connected with the external port of the accelerator card MC3 in the acceleration unit A2. In this way, even when three of the paths fail, there is another path to support communication between the acceleration units, further improving the safety of the acceleration components.
图19a为加速组件表示成网络拓扑的示意图。如图19a中所示,加速组件1405可以包括两个加速单元,每个加速单元可以包括四个加速卡,每个加速单元中的加速卡MC1和加速卡MC3之间可以具有两条链路,加速卡MC0和加速卡MC2之间可以具有两条链路。图19a的左图的加速装置1405可以形成右图所示的立体表现形式。图19a的右图中圆圈均代表加速卡,线条均表示链路连接,圆圈中数字0代表加速卡MC0,数字1代表加速卡MC1,数字2代表加速卡MC2,数字3代表加速卡MC3。右图表示的仍是加速组件1405,只是作为另外一种表现形式,即展示的是网络拓扑的形式。右图中竖线中嵌入的数字表示连接的端口数字,例如,两个加速单元中的MC0之间用端口0进行连接,MC1之间用端口0进行连接,MC2之间用端口3进行连接,MC3之间用端口3进行连接。Figure 19a is a schematic diagram of the acceleration components represented as a network topology. As shown in FIG. 19a, the acceleration component 1405 may include two acceleration units, each acceleration unit may include four acceleration cards, and there may be two links between the acceleration card MC1 and the acceleration card MC3 in each acceleration unit, There may be two links between the accelerator card MC0 and the accelerator card MC2. The acceleration device 1405 in the left figure of FIG. 19a can form the stereoscopic representation shown in the right figure. The circles in the right figure of Figure 19a represent accelerator cards, and the lines represent link connections. The number 0 in the circle represents the accelerator card MC0, the number 1 represents the accelerator card MC1, the number 2 represents the accelerator card MC2, and the number 3 represents the accelerator card MC3. The figure on the right still shows the acceleration component 1405, just as another form of expression, that is, the form of the network topology is shown. The numbers embedded in the vertical lines in the right figure represent the connected port numbers. For example, port 0 is used for connection between MC0 in two acceleration units, port 0 is used for connection between MC1, and port 3 is used for connection between MC2. Use port 3 to connect between MC3s.
对于图19a中的右图,将一个加速单元视为一个节点,两个节点具有8个加速卡,即两个节点就构成了所谓的8卡互连。每个节点内部的一机四卡互连关系是一定的,当两个节点互联时,上层节点(即加速单元A1)中MC0和MC1分别通过端口0和下面的节 点(即加速单元A2)的MC0和MC1相连;上层节点的MC2和MC3分别通过端口3和下层节点的MC2和MC3相连,这种节点拓扑结构称为混合立体网络拓扑(Hybrid Cube Mesh),即加速组件1405为一个混合立体网络拓扑。For the right picture in Figure 19a, one acceleration unit is regarded as a node, and two nodes have 8 accelerator cards, that is, two nodes constitute a so-called 8-card interconnection. The interconnection relationship of one machine and four cards inside each node is certain. When two nodes are interconnected, MC0 and MC1 in the upper node (ie acceleration unit A1) pass through port 0 and the lower node (ie acceleration unit A2) respectively. MC0 and MC1 are connected; MC2 and MC3 of the upper node are connected to MC2 and MC3 of the lower node through port 3 respectively. This node topology is called a hybrid three-dimensional network topology (Hybrid Cube Mesh), that is, the acceleration component 1405 is a hybrid three-dimensional network. topology.
在图19a所示的具有8卡的拓扑结构中,也可以形成两个独立的环。如图19b和图19c所示,这样能够最大限度地利用拓扑带宽来进行规约运算。In the topology with 8 cards shown in Figure 19a, two independent rings can also be formed. As shown in Figure 19b and Figure 19c, this maximizes the use of topology bandwidth for reduction operations.
在图19b中,加速单元A1中的加速卡MC1和MC3通过各自的内接端口5连接,加速卡MC0和MC2通过各自的内接端口5连接,而加速卡MC2和MC3通过各自的内接端口1连接;而加速单元A1中的加速卡MC1以及加速单元A2中的加速卡MC1通过各自的外接单口0相连接,加速单元A1中的加速卡MC0和加速单元A2中的加速卡MC0通过各自的外接端口0相连接。由此,在图19中的8个卡中形成一个独立的环。In Figure 19b, the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 5, the accelerator cards MC0 and MC2 are connected via their respective internal ports 5, and the accelerator cards MC2 and MC3 are connected via their respective internal ports 1 connection; while the accelerator card MC1 in the acceleration unit A1 and the accelerator card MC1 in the acceleration unit A2 are connected through their respective external single ports 0, and the accelerator card MC0 in the acceleration unit A1 and the accelerator card MC0 in the acceleration unit A2 are connected through their respective External port 0 is connected. Thus, a separate ring is formed among the 8 cards in FIG. 19 .
在图19c中,加速单元A1中的加速卡MC1和MC3通过各自的内接端口2连接,加速卡MC0和MC2通过各自的内接端口2连接,而加速卡MC0和MC1通过各自的内接端口1连接;而加速单元A1中的加速卡MC2以及加速单元A2中的加速卡MC2通过各自的外接单口3相连接,加速单元A1中的加速卡MC3和加速单元A2中的加速卡MC3通过各自的外接端口3相连接。由此,在图19中的8个卡中形成另一个独立的环。In Figure 19c, the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 2, the accelerator cards MC0 and MC2 are connected via their respective internal ports 2, and the accelerator cards MC0 and MC1 are connected via their respective internal ports 1 connection; the accelerator card MC2 in the acceleration unit A1 and the accelerator card MC2 in the acceleration unit A2 are connected through their respective external single ports 3, and the accelerator card MC3 in the acceleration unit A1 and the accelerator card MC3 in the acceleration unit A2 are connected through their respective External port 3 is connected. Thus, another independent loop is formed in the 8 cards in FIG. 19 .
上面仅仅示出了两种示例性的连接方式,但实际上,两个加速单元之间的四条连接路径实际上等效的,因此,可以采用这四条路径中的任意一条至三条来连接两个加速单元,并与每个加速单元内的加速卡形成环形连接。这里将不再赘述。Only two exemplary connection methods are shown above, but in fact, the four connection paths between the two acceleration units are actually equivalent, so any one to three of these four paths can be used to connect the two Accelerator units, and form a ring connection with the accelerator cards in each acceleration unit. It will not be repeated here.
图20为本公开又一个实施例中加速装置示意图。如图20中所示,加速装置2000可以包括n个上述加速单元,即加速单元A1、加速单元A2、加速单元A3、...、加速单元An,加速装置2000中的多个加速单元在逻辑上呈多层结构(图中以虚线示出),这里的多层可以包括奇数层或者偶数层,每一层可以包括一个加速单元,每个加速单元的加速卡通过外接端口与另一个加速单元中的加速卡相连接,其中,加速单元A1和加速单元A2之间通过外接端口相连接,加速单元A2和加速单元A3之间通过外接端口相连接,以此类推加速单元An-1和加速单元An之间通过外接端口相连接。并且最后一个加速单元可以与第一个加速单元相连接,从而所述多个加速单元首尾相连形成环形结构,例如图示中的加速单元An的加速卡MC0的外接端口与加速单元A1的加速卡MC0的外接端口相连。这样层层递进的配置组合使得每个加速卡在高速运算处理数据的同时,可以通过高速串行链路进行数据的共享,实现加速卡的无限互连,以满足可定制的算力需求,实现对处理器集群硬件算力的灵活配置。FIG. 20 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure. As shown in FIG. 20 , the acceleration device 2000 may include n above-mentioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . . , the acceleration unit An, and the acceleration units in the acceleration device 2000 are logical There is a multi-layer structure (shown in dotted lines in the figure), where the multi-layers can include odd or even layers, each layer can include an acceleration unit, and the accelerator card of each acceleration unit communicates with another acceleration unit through an external port Accelerator cards in the device are connected to each other, wherein, the acceleration unit A1 and the acceleration unit A2 are connected through an external port, the acceleration unit A2 and the acceleration unit A3 are connected through an external port, and so on. An is connected through an external port. And the last acceleration unit can be connected with the first acceleration unit, so that the multiple acceleration units are connected end to end to form a ring structure, for example, the external port of the acceleration card MC0 of the acceleration unit An in the figure and the acceleration card of the acceleration unit A1. Connect to the external port of MC0. Such a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster.
需要说明的是,本公开中的加速装置中加速单元的连接关系有多种情况,在前文中已经进行了详细描述,具体可参照例如上述图13中加速单元的连接关系的描述内容,在此不再赘述。另外,最后一个加速单元与第一个加速单元相连接的方式有多种,具体可以包括:加速单元A1中MC0的外接端口与An中MC0的外接端口相连、加速单元A1中MC1的外接端口与An中MC1的外接端口相连、加速单元A1中MC2的外接端口与An中MC2的外接端口相连、加速单元A1中MC3的外接端口与An中MC3的外接端口相连中的一种或多种连接方式。为了便于理解,以下将结合图21和图22进行示例性的描述。在下面的描述中,本领域技术人员可以理解的是,图21和图22所示的加速装置是图20所示的加速装置2000的多种具体化表现形式,因此关于图20的加速装置2000的相关描述也可以适用于图21和图22中的加速装置。It should be noted that the connection relationship of the acceleration unit in the acceleration device in the present disclosure has various situations, which have been described in detail above. For details, please refer to, for example, the description of the connection relationship of the acceleration unit in FIG. No longer. In addition, there are various ways in which the last acceleration unit is connected to the first acceleration unit, which may specifically include: the external port of MC0 in the acceleration unit A1 is connected to the external port of MC0 in An, and the external port of MC1 in the acceleration unit A1 is connected to One or more connection methods of connecting the external port of MC1 in An, connecting the external port of MC2 in the acceleration unit A1 with the external port of MC2 in An, and connecting the external port of MC3 in the acceleration unit A1 with the external port of MC3 in An. . For ease of understanding, an exemplary description will be made below in conjunction with FIG. 21 and FIG. 22 . In the following description, those skilled in the art can understand that the acceleration device shown in FIG. 21 and FIG. 22 are various embodied forms of the acceleration device 2000 shown in FIG. 20 . Therefore, the acceleration device 2000 shown in FIG. The relevant description of can also be applied to the acceleration device in FIG. 21 and FIG. 22 .
参考图21,图21为一个实施例中加速装置对应的网络拓扑示意图。如图21所示的加速装置2001可以由四个加速单元组成,圆圈均代表加速卡,线条均表示链路连接,圆圈中数字0代表加速卡MC0,数字1代表加速卡MC1,数字2代表加速卡MC2,数字3代表加速卡MC3;图中竖线中嵌入的数字表示连接的端口数字。最后一个加速单元与第一个加速单元相连接,总hop数为5次。每一个加速单元为一个节点,通过节点间的互联可以实现4个节点16卡的互连,四个加速单元组成一个小集群,内部互联,称为一个超级计算集群super pod。这种拓扑是超大规模集群的主推形态,采用高速SerDes端口,总Hop数为5次,延时最低。集群的可管理性较好,鲁棒性也比较好。Referring to FIG. 21 , FIG. 21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment. The acceleration device 2001 shown in FIG. 21 can be composed of four acceleration units. The circles represent accelerator cards, and the lines represent link connections. The number 0 in the circle represents the accelerator card MC0, the number 1 represents the accelerator card MC1, and the number 2 represents the acceleration card. Card MC2, the number 3 represents the accelerator card MC3; the number embedded in the vertical line in the figure represents the number of the connected port. The last acceleration unit is connected to the first acceleration unit, and the total number of hops is 5 times. Each acceleration unit is a node. Through the interconnection between nodes, 4 nodes and 16 cards can be interconnected. The four acceleration units form a small cluster, which is interconnected internally, which is called a super computing cluster super pod. This topology is the main push form of ultra-large-scale clusters, using high-speed SerDes ports, the total number of hops is 5, and the delay is the lowest. The manageability of the cluster is better, and the robustness is also better.
参考图22,图22为另一个实施例中加速装置对应的网络拓扑示意图。图22与图21的区别在于,图22中所示加速装置2002的加速单元数量更多。从图示中可以看出,加速装置2002的最后一个加速单元与第一个加速单元相连接。根据这样设置的加速装置,总hop数为节点数加一,即总hop数为加速单元的个数加一。Referring to FIG. 22 , FIG. 22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment. The difference between FIG. 22 and FIG. 21 is that the acceleration device 2002 shown in FIG. 22 has more acceleration units. It can be seen from the illustration that the last acceleration unit of the acceleration device 2002 is connected to the first acceleration unit. According to the acceleration device set in this way, the total number of hops is the number of nodes plus one, that is, the total number of hops is the number of acceleration units plus one.
上面结合图20-图22对包括多个加速单元的加速装置进行了示例性的描述,根据本公开的技术方案,还提供了一种可以包括多个前述加速组件的加速装置,以下将结合多个实施例进行详细描述。The acceleration device including a plurality of acceleration units is exemplarily described above with reference to FIGS. 20-22 . According to the technical solution of the present disclosure, an acceleration device that can include a plurality of the aforementioned acceleration components is also provided. Examples are described in detail.
图23为本公开又一个实施例中加速装置示意图,本公开的加速系统可以实现为一个加速装置。加速装置3000可以包括m个前述加速组件,每个加速组件中,除了加速组件内部需要进行加速单元间的连接的外接端口外,还有空闲的外接端口,加速组件之间通过空闲的外接端口相互连接,其中,加速组件B1中加速单元A1的加速卡MC1的外接端口可以与加速组件B2中加速单元A1的加速卡MC1的外接端口相连接,加速组件B2中加速单元A1的加速卡MC1的外接端口可以与加速组件B3中加速单元A1的加速卡MC1的外接端口相连接,依次类推,多个加速组件互相连接。可以理解的是,图23所示的加速装置是示例性的而非限制性的,例如,多个加速组件的结构可以相同或不同。还例如,不同加速组件之间通过空闲的外接端口连接的方式可以不限于图23中所示的方式,还可以包括其他方式。为了便于理解,以下将结合图24-图32进行示例性的描述。FIG. 23 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure. The acceleration system of the present disclosure may be implemented as an acceleration device. The acceleration device 3000 may include m aforesaid acceleration components. In each acceleration component, in addition to the external ports within the acceleration component that need to be connected between the acceleration units, there are also idle external ports, and the acceleration components communicate with each other through the idle external ports. Connection, wherein, the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B1 can be connected with the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2, and the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2 can be connected. The port can be connected to the external port of the acceleration card MC1 of the acceleration unit A1 in the acceleration component B3, and so on, and multiple acceleration components are connected to each other. It can be understood that the acceleration device shown in FIG. 23 is exemplary and not limiting, for example, the structures of the multiple acceleration components may be the same or different. Also for example, the manner of connecting different acceleration components through idle external ports may not be limited to the manner shown in FIG. 23 , and may also include other manners. For ease of understanding, an exemplary description will be made below with reference to FIGS. 24-32 .
基于图23提供的加速装置,进一步的,参考图24,图24为又一个实施例中加速装置对应的网络拓扑示意图,加速装置3001可以包括两个加速组件,加速组件B1可以包括四个加速单元,加速组件B2可以包括四个加速单元,加速组件B1中第一个加速单元与加速组件B2中第一个加速单元相连接,加速组件B1中最后一个加速单元与加速组件B2中最后一个加速单元相连接。此种网络拓扑下的总hop数为9。本领域技术人员可以理解的是,图24中每个加速组件中由多个加速单元构成的网络结构是逻辑上的,在实际应用中多个加速单元的排布位置可以根据需要进行调整。每个加速组件中的加速单元的数量可以不限于图示中的四个,可以根据需要设置的更多或者更少,例如可以设置为六个、八个等。Based on the acceleration device provided in FIG. 23, further, referring to FIG. 24, FIG. 24 is a schematic diagram of a network topology corresponding to the acceleration device in another embodiment, the acceleration device 3001 may include two acceleration components, and the acceleration component B1 may include four acceleration units , the acceleration component B2 may include four acceleration units, the first acceleration unit in the acceleration component B1 is connected with the first acceleration unit in the acceleration component B2, the last acceleration unit in the acceleration component B1 is connected with the last acceleration unit in the acceleration component B2 connected. The total number of hops in this network topology is 9. Those skilled in the art can understand that the network structure composed of multiple acceleration units in each acceleration component in FIG. 24 is logical, and the arrangement positions of multiple acceleration units can be adjusted as required in practical applications. The number of acceleration units in each acceleration assembly may not be limited to the four shown in the figure, and may be set more or less as required, for example, six, eight, etc. may be set.
基于图23提供的加速装置,进一步的,参考图25,图25为本公开又一个实施例中加速装置示意图,加速装置3002可以包括四个加速组件即加速组件B1、B2、B3和B4。四个加速组件中,每个加速组件可以包括两个加速单元A1和A2,每个加速组件可以通过加速单元A1和A2中的一个与其他加速组件的A1和A2中的一个相互连接。例如,加速组件B1中的加速单元A1与加速组件B2中的加速单元A1相连接,加速组件B2中的加速单元A1与加速组件B3中的加速单元A1相连接,加速组件B3中的加速单元A1与加 速组件B4中的加速单元A1相连接,这里的连接都是通过加速单元的外接端口进行连接。Based on the acceleration device provided in FIG. 23 , further referring to FIG. 25 , which is a schematic diagram of the acceleration device in yet another embodiment of the present disclosure, the acceleration device 3002 may include four acceleration components, namely, acceleration components B1 , B2 , B3 and B4 . Among the four acceleration assemblies, each acceleration assembly may include two acceleration units A1 and A2, and each acceleration assembly may be interconnected with one of the other acceleration units A1 and A2 through one of the acceleration units A1 and A2. For example, the acceleration unit A1 in the acceleration component B1 is connected to the acceleration unit A1 in the acceleration component B2, the acceleration unit A1 in the acceleration component B2 is connected with the acceleration unit A1 in the acceleration component B3, and the acceleration unit A1 in the acceleration component B3 is connected. It is connected to the acceleration unit A1 in the acceleration component B4, and the connections here are all connected through the external port of the acceleration unit.
需要说明的是,加速组件之间的连接方式除了图25中显示的连接方式外还可以有很多种。例如,加速组件之间的连接方式具体可以包括:加速组件B1中的加速单元A1或者A2与加速组件B2中的加速单元A1或者A2相连接,加速组件B2中的加速单元A1或A2与加速组件B3中的加速单元A1或A2相连接,以及加速组件B3中的加速单元A1或者A2与加速组件B4中的加速单元A1或A2相连接。It should be noted that, in addition to the connection modes shown in FIG. 25 , there may be many connection modes between the acceleration components. For example, the connection between the acceleration components may specifically include: the acceleration unit A1 or A2 in the acceleration component B1 is connected with the acceleration unit A1 or A2 in the acceleration component B2, and the acceleration unit A1 or A2 in the acceleration component B2 is connected with the acceleration component. The acceleration unit A1 or A2 in B3 is connected, and the acceleration unit A1 or A2 in the acceleration assembly B3 is connected with the acceleration unit A1 or A2 in the acceleration assembly B4.
基于图25提供的加速装置,进一步的,请参考图26,图26为本公开又一个实施例中加速装置示意图。如图26所示的加速装置3003中,每个加速组件可以通过第一加速单元和第二加速单元中的一个,利用两条路径与其他加速组件的第一加速单元和第二加速单元中的一个相互连接。例如图示中的加速组件B1中的第一加速单元(例如加速单元A1)与加速组件B2中的第一加速单元(例如加速单元A1)可以通过两条路径相连接,加速组件B2中的加速单元A1与加速组件B3中的加速单元A1通过两条路径相连接,加速组件B3中的加速单元A1与加速组件B4中的加速单元A1通过两条路径相连接。Based on the acceleration device provided in FIG. 25 , please refer to FIG. 26 , which is a schematic diagram of the acceleration device in yet another embodiment of the present disclosure. In the acceleration device 3003 shown in FIG. 26 , each acceleration component can pass through one of the first acceleration unit and the second acceleration unit, and use two paths to communicate with other acceleration components in the first acceleration unit and the second acceleration unit by using two paths. one is interconnected. For example, the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B1 in the figure and the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B2 can be connected by two paths, and the acceleration in the acceleration component B2 can be connected by two paths. The unit A1 is connected with the acceleration unit A1 in the acceleration component B3 through two paths, and the acceleration unit A1 in the acceleration component B3 is connected with the acceleration unit A1 in the acceleration component B4 through two paths.
需要说明的是,图26中标记的是两条路径连接,实际上还可以包括两条以上的路径连接的情况。加速组件之间的连接方式除了图26中所示的连接方式,还可以包括其他方式,例如加速组件B1中的加速单元A1或者A2可以利用两条路径与加速组件B2中的加速单元A1或者A2相连接,加速组件B2中的加速单元A1或A2可以利用两条路径与加速组件B3中的加速单元A1或A2相连接,以及加速组件B3中的加速单元A1或者A2可以利用两条路径与加速组件B4中的加速单元A1或A2相连接。It should be noted that, what is marked in FIG. 26 is the connection of two paths, and in fact, it may also include the connection of more than two paths. In addition to the connection method shown in FIG. 26, the connection method between the acceleration components can also include other methods. For example, the acceleration unit A1 or A2 in the acceleration component B1 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B2. Connected, the acceleration unit A1 or A2 in the acceleration component B2 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B3, and the acceleration unit A1 or A2 in the acceleration component B3 can use two paths to connect with the acceleration unit A1 or A2. Acceleration unit A1 or A2 in component B4 is connected.
基于图23提供的加速装置,进一步的,请参考图27,图27为本公开又一个实施例中加速装置示意图,加速装置3004包括四个加速组件,分别是加速组件B1、加速组件B2、加速组件B3、加速组件B4,每个加速组件包括两个加速单元,每个加速单元包括两对加速卡。每个加速单元中,MC0和MC1是第一对加速卡,MC2和MC3是第二对加速卡。其中,加速组件B1的加速单元A1的第二对加速卡与加速组件B2的加速单元A2的第二对加速卡相连;加速组件B2的加速单元A2的第一对加速卡与加速组件B3的加速单元A1的第一对加速卡相连;加速组件B3的加速单元A2的第二对加速卡与加速组件B4的加速单元A1的第二对加速卡相连;加速组件B4的加速单元A1的第一对加速卡与加速组件B1的加速单元A2的第一对加速卡相连。Based on the acceleration device provided in FIG. 23 , please refer to FIG. 27 . FIG. 27 is a schematic diagram of the acceleration device in another embodiment of the present disclosure. The acceleration device 3004 includes four acceleration components, namely, the acceleration component B1, the acceleration component B2, the acceleration component Component B3 and acceleration component B4, each acceleration component includes two acceleration units, and each acceleration unit includes two pairs of acceleration cards. In each acceleration unit, MC0 and MC1 are the first pair of accelerator cards, and MC2 and MC3 are the second pair of accelerator cards. Wherein, the second pair of accelerator cards of the acceleration unit A1 of the acceleration component B1 is connected with the second pair of accelerator cards of the acceleration unit A2 of the acceleration component B2; the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B2 is connected with the acceleration component B3 The first pair of accelerator cards of unit A1 is connected; the second pair of accelerator cards of acceleration unit A2 of acceleration unit B3 is connected to the second pair of accelerator cards of acceleration unit A1 of acceleration unit B4; the first pair of accelerator unit A1 of acceleration unit B4 is connected The accelerator card is connected to the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B1.
参考图28,图28是又一种加速装置的网络拓扑示意图。图28所示加速装置3005是图27所示加速装置3004的一个具体化形式,因此上述关于加速装置3004的相关描述也可以适用于图28中的加速装置3005。如图28中所示,加速装置3005的每个加速组件可以形成一个混合立体网络单元,每个混合立体网络单元内部的互联关系可以如图中所示,实现加速装置3005的8节点32卡的互连。四个加速组件可以通过例如QSFP-DD接口和电缆实现多卡多节点的互连,形成矩阵网络拓扑。Referring to FIG. 28 , FIG. 28 is a schematic diagram of a network topology of another acceleration device. The acceleration device 3005 shown in FIG. 28 is a specific form of the acceleration device 3004 shown in FIG. 27 , so the above related descriptions about the acceleration device 3004 can also be applied to the acceleration device 3005 in FIG. 28 . As shown in FIG. 28 , each acceleration component of the acceleration device 3005 can form a hybrid three-dimensional network unit, and the interconnection relationship within each hybrid three-dimensional network unit can be as shown in the figure, to realize the 8 nodes and 32 cards of the acceleration device 3005. interconnection. The four acceleration components can be interconnected with multiple cards and multiple nodes through, for example, QSFP-DD interfaces and cables, forming a matrix network topology.
具体地,本实施例中的加速组件B1的上层节点的加速卡MC2、MC3的端口0可以分别和加速组件B2的下层节点的加速卡MC2、MC3相连,加速组件B2的下层节点的MC0和MC1的端口3可以分别与加速组件B3的上层节点的MC0和MC1相连,加速组件B3的下层节点的MC2和MC3的端口0可以分别与加速组件B4的上层节点的MC2和MC3相连,加速组件B4的上层节点的MC0和MC1的端口3可以分别与加速组件B1的下层节点的MC0和MC1相连。这样设置的混合立体网络之间的互联可以构成两个双向环 结构(如上文中结合图12b、图12c,图19b和图19c所描述的那样),具有较好的可靠性和安全性等优点,并且适用于深度学习训练,运算效率高。对于加速装置3005中,由8个节点构成的矩阵网络拓扑,总Hop数为11次。Specifically, ports 0 of the accelerator cards MC2 and MC3 of the upper node of the acceleration component B1 in this embodiment may be respectively connected to the accelerator cards MC2 and MC3 of the lower node of the acceleration component B2, and MC0 and MC1 of the lower node of the acceleration component B2 The port 3 of the acceleration component B3 can be respectively connected with the MC0 and MC1 of the upper node of the acceleration component B3, and the ports 0 of the MC2 and MC3 of the lower node of the acceleration component B3 can be respectively connected with the MC2 and MC3 of the upper node of the acceleration component B4. Ports 3 of MC0 and MC1 of the upper node can be respectively connected to MC0 and MC1 of the lower node of the acceleration component B1. The interconnection between the hybrid three-dimensional networks set in this way can form two bidirectional ring structures (as described above in conjunction with Fig. 12b, Fig. 12c, Fig. 19b and Fig. 19c), which has the advantages of better reliability and security, etc. And it is suitable for deep learning training and has high computing efficiency. For the matrix network topology consisting of 8 nodes in the acceleration device 3005, the total number of Hops is 11 times.
进一步地,如图28中所示,同一个加速组件中不同加速单元中的第一对加速卡和第二对加速卡可以间接连接。例如,加速组件B1中的上层加速单元的加速卡MC0和MC1与下层加速单元的加速卡MC2和MC3间接连接。Further, as shown in FIG. 28 , the first pair of accelerator cards and the second pair of accelerator cards in different acceleration units in the same acceleration assembly may be indirectly connected. For example, the accelerator cards MC0 and MC1 of the upper-layer acceleration unit in the acceleration component B1 are indirectly connected with the accelerator cards MC2 and MC3 of the lower-layer acceleration unit.
在图28的网络拓扑基础上,以矩阵网络拓扑为基本单元可以进一步扩展成更大的网络拓扑,图29为基于加速装置无线扩展的矩阵网络拓扑示意图。如图29所示,加速装置3006中可以包括多个加速组件,每个加速组件(图中以方框示出)可以包括多个加速单元(未示出立体图,可参考图28的加速组件结构),每个加速单元可以包括例如图示中的四个加速卡互连,因此该矩阵网络拓扑理论上可以无限扩展。On the basis of the network topology in FIG. 28 , the matrix network topology can be further expanded into a larger network topology by taking the matrix network topology as the basic unit. FIG. 29 is a schematic diagram of the matrix network topology based on the wireless expansion of the acceleration device. As shown in FIG. 29 , the acceleration device 3006 may include multiple acceleration components, and each acceleration component (shown as a block in the figure) may include multiple acceleration units (a perspective view is not shown, please refer to the structure of the acceleration component in FIG. 28 ) ), each acceleration unit may include, for example, the interconnection of four acceleration cards as shown in the illustration, so the matrix network topology can theoretically expand infinitely.
基于图23提供的加速装置,进一步的,请参考图30,图30为本公开又一个实施例中加速装置示意图,加速装置3008可以包括m(m 2)个加速组件,每个加速组件可以包括n(n 2)个加速单元,并且m个加速组件可以呈环形连接。其中,加速组件B1的加速单元An可以与加速组件B2的加速单元A1相连接,加速组件B2的加速单元An可以与加速组件B3的加速单元A1相连接,以此类推到加速组件Bm,加速组件Bm的加速单元An可以与加速组件B1的加速单元A1相连接,从而这m个加速组件首尾相连,呈环形连接。Based on the acceleration device provided in FIG. 23 , please refer to FIG. 30 , which is a schematic diagram of the acceleration device in another embodiment of the present disclosure. The acceleration device 3008 may include m (m 2 ) acceleration components, and each acceleration component may include n(n 2) acceleration units, and m acceleration components can be connected in a ring. Among them, the acceleration unit An of the acceleration component B1 can be connected with the acceleration unit A1 of the acceleration component B2, the acceleration unit An of the acceleration component B2 can be connected with the acceleration unit A1 of the acceleration component B3, and so on to the acceleration component Bm, the acceleration component The acceleration unit An of Bm can be connected to the acceleration unit A1 of the acceleration assembly B1, so that the m acceleration assemblies are connected end to end in a ring connection.
基于图30,请参考图31,图31是又一种加速装置的网络拓扑示意图,加速装置3009可以包括6个加速组件,每个加速组件可以包括两个加速单元,每个加速组件的第二个加速单元可以与下一个加速组件的第一个加速单元相连,形成了12个节点48卡的互联,形成更大的矩阵网络拓扑,此种网络拓扑下的总Hop为13次。Based on FIG. 30 , please refer to FIG. 31 . FIG. 31 is a schematic diagram of the network topology of another acceleration device. The acceleration device 3009 may include 6 acceleration components, each acceleration component may include two acceleration units, and the second acceleration unit of each acceleration component Each acceleration unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 12 nodes and 48 cards, forming a larger matrix network topology. The total Hop under this network topology is 13 times.
基于图31,请参考图32,图32是又一种加速装置的网络拓扑示意图,加速装置3010包括8个加速组件,每个加速组件包括两个加速单元,每个加速组件的第二个加速单元可以与下一个加速组件的第一个加速单元相连,形成了16个节点64卡的互联,形成更大的矩阵网络拓扑,此种网络拓扑下的总Hop为17次。Based on FIG. 31 , please refer to FIG. 32 . FIG. 32 is a schematic diagram of a network topology of another acceleration device. The acceleration device 3010 includes 8 acceleration components, each acceleration component includes two acceleration units, and the second acceleration unit of each acceleration component The unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 16 nodes and 64 cards, forming a larger matrix network topology. The total Hop under this network topology is 17 times.
在图32的基础上可以一直纵向扩展,形成例如20个节点80卡、24个节点96卡等超大规模矩阵网络。理论上可以一直无限扩展下去,总Hop数为节点数加一。通过优化节点间的互联方式,可以使得整个系统的延时最小,能够最大限度的满足系统在处理海量数据的同时对实时性的要求。On the basis of Fig. 32, it can be extended vertically to form a super large-scale matrix network such as 20 nodes with 80 cards and 24 nodes with 96 cards. In theory, it can be extended infinitely, and the total number of Hops is the number of nodes plus one. By optimizing the interconnection between nodes, the delay of the entire system can be minimized, and the real-time requirements of the system can be met to the greatest extent while processing massive data.
上面结合图23-图32对包括多个加速组件的加速装置进行了示例性的描述,本领域技术人员可以理解的是,上面的描述是示例性的而非限制性的,例如加速组件的数量、结构以及加速组件之间的连接关系等均可以根据需要进行调整。本领域技术人员还可以根据需要将上述多个实施例组合形成加速装置,也在本公开的保护范围内。The acceleration device including a plurality of acceleration components has been exemplarily described above with reference to FIGS. 23 to 32. Those skilled in the art can understand that the above description is exemplary rather than limiting, such as the number of acceleration components. , structure, and the connection relationship between acceleration components can be adjusted as needed. Those skilled in the art can also combine the above multiple embodiments to form an acceleration device as required, which is also within the protection scope of the present disclosure.
另外,需要说明的是,本公开中所述的加速卡矩阵、全连接方形网络(拓扑)、混合立体网络(拓扑)、矩阵网络(拓扑)等都是逻辑上的,具体的布设形式可以根据需要进行调整。In addition, it should be noted that the accelerator card matrix, fully connected square network (topology), hybrid three-dimensional network (topology), matrix network (topology), etc. described in this disclosure are all logical, and the specific layout can be based on Adjustment is required.
本公开所公开的拓扑结构还可以进行数据的规约运算。规约运算可以在每个加速卡,每个加速单元以及在加速装置中进行。具体的操作步骤可以如下。The topology disclosed in the present disclosure can also perform data reduction operations. The reduction operation can be performed on each accelerator card, each accelerator unit and in the accelerator device. The specific operation steps can be as follows.
以规约求和运算为例,在一个加速单元中进行的规约运算过程可以包括:将第一个 加速卡中存储的数据传递到第二个加速卡中,并在第二个加速卡中对原先存储在第二个加速卡中的数据以及从第一个加速卡中接收的数据进行加法运算;接下来,再将在第二个加速卡中的加法运算结果传递到第三个加速卡中,再进行加法运算,以此类推,直到加速卡中存储的所有数据都进行了加法运算,并且每个加速卡均接收到了最终的运算结果。Taking the reduction and sum operation as an example, the reduction operation process performed in one acceleration unit may include: transferring the data stored in the first acceleration card to the second acceleration card, and comparing the original data in the second acceleration card. The data stored in the second accelerator card and the data received from the first accelerator card are added; then, the result of the addition operation in the second accelerator card is transferred to the third accelerator card, The addition operation is performed again, and so on, until all the data stored in the accelerator card are added, and each accelerator card has received the final operation result.
以图11所示的加速单元为例,加速卡MC0中存储了数据(0,0),加速卡MC1中存储了数据(1,2),加速卡MC2中存储了数据(3,1),以及加速卡MC3中存储了数据(2,4)。可以将加速卡MC0中的数据(0,0)传递到加速卡MC1中,进行了加法运算之后得到结果(1,2);接下来,将结果(1,2)传递到加速卡MC2中,得到下一个结果(4,3);然后,再将该下一个结果(4,3)传递到加速卡MC3中,得到最终的结果(6,7)。Taking the acceleration unit shown in Figure 11 as an example, the accelerator card MC0 stores data (0, 0), the accelerator card MC1 stores data (1, 2), and the accelerator card MC2 stores data (3, 1). And data (2,4) is stored in the accelerator card MC3. The data (0,0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1,2) can be obtained after the addition operation; then, the result (1,2) can be transferred to the accelerator card MC2, The next result (4,3) is obtained; then, the next result (4,3) is transferred to the accelerator card MC3 to obtain the final result (6,7).
此后,在本公开的规约运算中,继续将最终的结果(6,7)传递到每个加速卡MC0、MC1,MC2和MC3中,从而所有加速卡中均存储了数据(6,7),从而在一个加速单元中完成了规约运算。After that, in the reduction operation of the present disclosure, the final result (6, 7) is continued to be transmitted to each of the accelerator cards MC0, MC1, MC2 and MC3, so that the data (6, 7) are stored in all the accelerator cards, Thus, the reduction operation is completed in one acceleration unit.
图11所述的加速单元,可以形成两个独立的环,每个环可以完成一半数据的规约运算,从而加快运算速度,提高运算效率。The acceleration unit shown in FIG. 11 can form two independent rings, and each ring can complete the reduction operation of half of the data, thereby speeding up the operation speed and improving the operation efficiency.
另外,上述加速单元在进行规约运算时,也可以实现多个加速卡并发计算,从而加快运算速度。比如,加速卡MC0中存储了数据(0,0),加速卡MC1中存储了数据(1,2),加速卡MC2中存储了数据(3,1),以及加速卡MC3中存储了数据(2,4)。可以将加速卡MC0中的部分数据(0)传递到加速卡MC1中,进行了加法运算之后得到结果(1),同步将加速卡MC1中的一部分数据(2)传递到加速卡MC2,进行了加法运算之后得到结果(3),从而实现了加速卡MC1和MC2的并发运算;以此类推,完成整个规约运算。In addition, when the above-mentioned acceleration unit performs the reduction operation, it can also realize the concurrent calculation of multiple acceleration cards, thereby speeding up the operation speed. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data ( 2,4). Part of the data (0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1) is obtained after the addition operation, and part of the data (2) in the accelerator card MC1 can be transferred to the accelerator card MC2 synchronously. After the addition operation, the result (3) is obtained, thereby realizing the concurrent operation of the accelerator cards MC1 and MC2; and so on, the entire protocol operation is completed.
上述的并发计算还可以包括成组加速单元先进行加法运算,再把本组加速单元的运算结果与另一组加速单元的运算结果进行规约运算。例如,加速卡MC0中存储了数据(0,0),加速卡MC1中存储了数据(1,2),加速卡MC2中存储了数据(3,1),以及加速卡MC3中存储了数据(2,4),可以将加速卡MC0中的数据传递到加速卡MC1中进行运算以得到第一组结果(1,2);同步或异步地,可以将加速卡MC2中的数据传递到加速卡MC3中进行运算以得到第二组结果(5,5)。接下来,再将第一组结果与第二组结果进行运算得到最终的规约结果(6,7)。The above-mentioned concurrent calculation may further include that a group of acceleration units performs an addition operation first, and then performs a reduction operation on the operation result of this group of acceleration units and the operation result of another group of acceleration units. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data ( 2,4), the data in the accelerator card MC0 can be transferred to the accelerator card MC1 for operation to obtain the first set of results (1,2); synchronously or asynchronously, the data in the accelerator card MC2 can be transferred to the accelerator card Operations are performed in MC3 to obtain the second set of results (5,5). Next, the first set of results and the second set of results are operated to obtain the final reduction result (6,7).
类似地,除了在一个加速单元中进行规约运算之外,也可以在加速组件或加速装置中进行规约运算。需要理解的是,加速装置也可以认为是首尾连接的加速组件。Similarly, in addition to performing reduction operations in an acceleration unit, reduction operations may also be performed in acceleration components or acceleration devices. It should be understood that the acceleration device can also be considered as an acceleration component connected end to end.
当在加速组件或者加速装置中进行规约运算时,可以包括:将同一加速单元的加速卡中的数据进行第一规约运算以在每个加速单元中得到第一规约结果;将多个加速单元中的第一规约结果进行第二规约运算,以得到第二规约结果。When performing a reduction operation in an acceleration component or an acceleration device, it may include: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result in each acceleration unit; The second reduction operation is performed on the first reduction result of , to obtain the second reduction result.
同样以规约求和运算为例,上述的第一个步骤已经在上文中进行了描述,对于包括多个加速单元的加速装置,可以首先在每个加速单元中进行局部的规约运算,当每个加速单元中的规约运算完成之后,同一个加速单元中的加速卡将获取到局部规约运算的结果,这里称为第一规约结果。Also taking the reduction sum operation as an example, the first step above has been described above. For an acceleration device including multiple acceleration units, a local reduction operation can be performed in each acceleration unit first. After the reduction operation in the acceleration unit is completed, the accelerator card in the same acceleration unit will obtain the result of the local reduction operation, which is referred to as the first reduction result here.
接下来,可以将所有加速单元中的第一规约结果在相邻的加速单元中进行传递并进行加法运算。由此,与在一个加速单元中进行规约运算类似,第一加速单元将第一规约结果传递到第二加速单元中,在第二加速单元的加速卡中分别进行了加法运算之后,再进行结果的传递和加法运算。在进行了最后一次加法运算之后,再将最终的结果传导到每一个 加速单元中。Next, the first reduction results in all acceleration units may be transferred and added to adjacent acceleration units. Therefore, similar to the reduction operation performed in one acceleration unit, the first acceleration unit transmits the first reduction result to the second acceleration unit, and after the addition operation is performed in the accelerator card of the second acceleration unit, the result is performed. pass and add operations. After the last addition, the final result is passed to each acceleration unit.
需要指出的时,由于上文中的加速组件并不一定首尾连接,因此在将最终结果传导到每一个加速单元的情况下,可以反向传导,而不是如加速单元首尾连接时那种循环传导。本公开的技术方案对于如何传导最终结果并不做具体限定。It should be pointed out that since the acceleration components above are not necessarily connected end-to-end, in the case of transmitting the final result to each acceleration unit, it can be conducted in reverse, instead of cyclic transmission as when the acceleration units are connected end-to-end. The technical solution of the present disclosure does not specifically limit how to conduct the final result.
更进一步地,根据本公开的一个实施方式,所述加速装置还可以配置为进行规约运算,包括:将同一加速单元的加速卡中的数据进行第一规约运算以得到第一规约结果;将同一加速组件的多个加速单元中的第一规约结果进行中间规约运算,以得到中间规约结果;将多个加速组件中的中间规约结果进行第二规约运算,以得到第二规约结果。Further, according to an embodiment of the present disclosure, the acceleration device may also be configured to perform a reduction operation, including: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result; Perform an intermediate reduction operation on the first reduction results in the multiple acceleration units of the acceleration component to obtain an intermediate reduction result; perform a second reduction operation on the intermediate reduction results in the multiple acceleration components to obtain a second reduction result.
在该实施方式中,可以首先在同一个加速单元中进行规约运算,这已经在上文中进行了描述,这里将不再赘述。In this implementation manner, the reduction operation may be performed first in the same acceleration unit, which has been described above, and will not be repeated here.
接下来,可以在每个加速组件中进行规约运算,使得每个加速组件中每个加速卡获取到本加速组件中的局部规约结果;接下来,再以加速组件为单位,在多个加速组件中进行规约运算,从而使得每个加速卡都获取到加速装置中的全局规约结果。Next, a reduction operation can be performed in each acceleration component, so that each acceleration card in each acceleration component can obtain the local reduction result in this acceleration component; The reduction operation is performed in the acceleration device, so that each acceleration card can obtain the global reduction result in the acceleration device.
上面介绍了加速卡系统的各种实施方式,下面将介绍基于该加速卡系统的进行通信的更具体的方法。Various implementations of the accelerator card system have been described above, and a more specific method for communication based on the accelerator card system will be described below.
在本公开中,通信任务队列和通信任务执行队列是分开的,区分通信任务队列和通信任务执行队列可以使得用户在无感知的情况下进行任务的容错或重传等操作。In the present disclosure, the communication task queue and the communication task execution queue are separated, and distinguishing the communication task queue and the communication task execution queue can enable users to perform tasks such as fault tolerance or retransmission without perception.
在本公开中,可以将通信任务作为异步任务下发到加速卡系统中的任意一个加速卡上,并形成通信任务队列,通信任务队列和通信任务执行队列可以处于不同的加速卡上。同一个通信队列中的通信任务将被依次执行。对这些通信任务的执行可以由另外的加速卡来完成,从而通信任务队列和通信任务执行队列处于不同的加速卡上。In the present disclosure, a communication task may be delivered to any accelerator card in the accelerator card system as an asynchronous task, and a communication task queue may be formed, and the communication task queue and the communication task execution queue may be located on different accelerator cards. Communication tasks in the same communication queue will be executed sequentially. The execution of these communication tasks can be completed by another accelerator card, so that the communication task queue and the communication task execution queue are located on different accelerator cards.
优选地,对于某些数据量比较大的通信任务,还可以将其拆分为多个通信任务来执行,同一个队列上的通信任务会按照任务下发的顺序串行执行,而不同队列上的任务可以并发执行,由此,一个总通信任务在被划分为多个分通信任务并并行之后,将极大地提升通信任务的执行效率。对于诸如Allreduce之类的通信任务,数据可以通过不同的通信路径从一个加速卡传送到另外一个加速卡,因此,当一个总通信任务被划分为多个并行执行的分通信任务后,可以通过不同的通信路径来执行这些分通信任务。Preferably, for some communication tasks with a relatively large amount of data, it can also be divided into multiple communication tasks for execution. The tasks can be executed concurrently. Therefore, after a total communication task is divided into multiple sub-communication tasks and parallelized, the execution efficiency of the communication task will be greatly improved. For communication tasks such as Allreduce, data can be transmitted from one accelerator card to another through different communication paths. Therefore, when a total communication task is divided into multiple sub-communication tasks executed in parallel, different communication paths to perform these sub-communication tasks.
以图12a-图12c所示的加速卡连接方式为例,当需要将数据从加速卡MC1传送到加速卡MC3时,则数据可以通过如下多条通信路径来传送:Taking the connection mode of the accelerator card shown in Figure 12a to Figure 12c as an example, when data needs to be transmitted from the accelerator card MC1 to the accelerator card MC3, the data can be transmitted through the following multiple communication paths:
1、如图12b所示,数据可以从加速卡MC1的1号端口传送到加速卡MC0的1号端口,再从加速卡MC0的2号端口传送到加速卡MC2的2号端口,最后从加速卡MC2的1号端口传送到加速卡MC3的1号端口;1. As shown in Figure 12b, data can be transmitted from port 1 of accelerator card MC1 to port 1 of accelerator card MC0, and then from port 2 of accelerator card MC0 to port 2 of accelerator card MC2, and finally from the accelerator card. The port 1 of the card MC2 is sent to the port 1 of the accelerator card MC3;
2、如图12b所示,数据还可以从加速卡MC1的2号端口直接传送到加速卡MC3的2号端口;2. As shown in Figure 12b, the data can also be directly transmitted from the No. 2 port of the accelerator card MC1 to the No. 2 port of the accelerator card MC3;
3、如图12c所示,数据可以从加速卡MC1的4号端口传送到加速卡MC2的4号端口,再从加速卡MC2的5号端口传送到加速卡MC0的5号端口,最后从加速卡MC0的4号端口传送到加速卡MC3的4号端口;3. As shown in Figure 12c, data can be transmitted from port 4 of accelerator card MC1 to port 4 of accelerator card MC2, and then from port 5 of accelerator card MC2 to port 5 of accelerator card MC0, and finally from the accelerator card. The 4th port of the card MC0 is sent to the 4th port of the accelerator card MC3;
4、如图12c所示,数据还可以从加速卡MC1的5号端口直接传送到加速卡MC3的5号端口。4. As shown in Figure 12c, data can also be directly transmitted from port 5 of the accelerator card MC1 to port 5 of the accelerator card MC3.
由此可见,在公开的技术方案中,两个加速卡之间的通信可以通过多条通信路径来 进行,换言之,两个加速卡之间的通信可以通过不同的拓扑结构来进行。由此,当将一个总通信任务划分为多个分通信任务时,每个分通信任务可以通过不同的通信路径来执行。It can be seen that, in the disclosed technical solution, the communication between the two accelerator cards can be performed through multiple communication paths, in other words, the communication between the two accelerator cards can be performed through different topological structures. Thus, when a total communication task is divided into a plurality of sub-communication tasks, each sub-communication task can be executed through a different communication path.
需要理解的是,上面结合图12a-图12c的描述仅仅是一个简单的示例,当加速卡系统包含更多的加速卡时,则通信路径会更加复杂和多样,由此也可以将总通信任务划分为更多数量的分通信任务。It should be understood that the above description in conjunction with Fig. 12a-Fig. 12c is only a simple example. When the accelerator card system includes more accelerator cards, the communication paths will be more complex and diverse, so that the total communication tasks can also be reduced. Divide into a greater number of sub-communication tasks.
在通信任务队列中,可以设置多个状态标识,这些状态标识可以监视通信任务的执行情况,并也可以控制其他通信任务的执行情况。通信任务的执行会改变状态标识,而状态标识的改变也会相应地改变其他通信任务的执行。这些状态标识将在下文中进行更详细的描述。In the communication task queue, multiple status flags can be set, and these status flags can monitor the execution of communication tasks and also control the execution of other communication tasks. The execution of the communication task will change the state flag, and the change of the state flag will correspondingly change the execution of other communication tasks. These status flags will be described in more detail below.
根据本公开的一个实施方式,虽然可以将通信任务队列加载到任何一个加速卡上,但优选地,可以将通信任务队列加载到低负荷的加速卡上。需要理解的是,当多个加速卡都参与运算和通信时,每个加速卡的负载可能有所不同,可以优选地选择低负荷的加速卡来承载通信任务队列。在加速卡可以接收来自于主机或其他加速卡的通信任务,并形成队列,并控制队列中各通信任务的执行。这样的方式有助于充分利用加速卡资源,提升系统的整体运行效率。According to an embodiment of the present disclosure, although the communication task queue can be loaded on any one of the accelerator cards, preferably, the communication task queue can be loaded on the accelerator card with low load. It should be understood that when multiple accelerator cards are involved in computing and communication, the load of each accelerator card may be different, and an accelerator card with a low load may preferably be selected to carry the communication task queue. The accelerator card can receive communication tasks from the host or other accelerator cards, form a queue, and control the execution of each communication task in the queue. This method helps to make full use of accelerator card resources and improve the overall operating efficiency of the system.
图36a示出了根据本公开一个实施方式的执行通信任务的方法流程图;图36b示出了根据本公开一个实施方式的任务下发队列和通信任务执行队列的示意图。Fig. 36a shows a flowchart of a method for executing a communication task according to an embodiment of the present disclosure; Fig. 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure.
如图36a所示,根据本公开的一个实施方式,本公开的方法进一步包括:在操作S3610,将所述通信任务队列中的一个总通信任务划分为多个分通信任务,每个分通信任务处于不同的通信任务执行队列中;在操作S3620,通过不同的通信路径并行地执行所述多个分通信任务;以及在操作S3630,响应于所述分通信任务执行完毕,使得所述总通信任务执行完毕。As shown in FIG. 36a, according to an embodiment of the present disclosure, the method of the present disclosure further includes: in operation S3610, dividing a total communication task in the communication task queue into a plurality of sub-communication tasks, each sub-communication task be in different communication task execution queues; in operation S3620, execute the plurality of sub-communication tasks in parallel through different communication paths; and in operation S3630, in response to the completion of execution of the sub-communication tasks, so that the total communication task is performed Finished.
下面结合图36b来详细描述上述的方法。The above method will be described in detail below in conjunction with FIG. 36b.
在图36b中包括了两种类型的队列,即通信任务队列LQ和通信任务执行队列PQ,通信任务队列中可以接收多个总通信任务,例如总通信任务A,B和C等等,这些总通信任务A、B和C在进入通信任务队列LQ时,串行地组合在一起,执行顺序为A、B和C,也就是说,在通信任务A执行时,通信任务B和C需要等待,而通信任务B只有等到通信任务A执行完毕之后才能执行;而通信任务C需要等待通信任务B执行完毕之后才能执行。这样的任务执行方式无法充分利用系统的并行运行资源,特别是当某一个通信任务的执行时间特别长或通信数据量特别大的时候,其他通信任务的执行会明显受到阻塞,从而系统性能受到影响。In Figure 36b, two types of queues are included, namely the communication task queue LQ and the communication task execution queue PQ. The communication task queue can receive multiple total communication tasks, such as total communication tasks A, B and C, etc. When communication tasks A, B, and C enter the communication task queue LQ, they are serially combined, and the execution order is A, B, and C. That is, when communication task A is executed, communication tasks B and C need to wait. The communication task B can only be executed after the execution of the communication task A is completed; and the communication task C can be executed only after the execution of the communication task B is completed. Such a task execution method cannot make full use of the parallel operation resources of the system, especially when the execution time of a certain communication task is particularly long or the amount of communication data is particularly large, the execution of other communication tasks will be obviously blocked, and the system performance will be affected. .
可以将通信任务队列LQ中的通信任务视为总通信任务,并将总通信任务划分为多个并行执行的分通信任务,并放置在通信任务执行队列PQ中执行。当一个总通信任务被划分为多个分通信任务并行执行时,可以显著地提升任务的执行效率。The communication task in the communication task queue LQ can be regarded as a total communication task, and the total communication task is divided into multiple sub-communication tasks executed in parallel, and placed in the communication task execution queue PQ for execution. When a total communication task is divided into multiple sub-communication tasks to be executed in parallel, the execution efficiency of the task can be significantly improved.
在本公开中,以总通信任务B为例,可以将该总通信任务B划分为多个分通信任务b1,b2等等,这里以两个分通信任务b1和b2为例来进行说明。需要说明的是,分通信任务的划分数量可以是其他数量,这可以根据加速卡系统的拓扑结构来确定。例如,如果从一个加速卡到达另外一个加速卡的通信路径越多,那么可以将总通信任务划分为更多的分通信任务,反之则可以将总通信任务划分为更少的分通信任务;或者总通信任务涉及的数据量越大,则可以将其划分为更多的分通信任务。In the present disclosure, taking the total communication task B as an example, the total communication task B can be divided into a plurality of sub-communication tasks b1, b2, etc., and two sub-communication tasks b1 and b2 are taken as an example for description here. It should be noted that the number of divided communication tasks may be other numbers, which may be determined according to the topology of the accelerator card system. For example, if there are more communication paths from one accelerator card to another, then the total communication task can be divided into more sub-communication tasks, and vice versa, the total communication task can be divided into fewer sub-communication tasks; or The larger the amount of data involved in the total communication task, the more sub-communication tasks it can be divided into.
当将总通信任务B划分为分通信任务b1和b2,并将这些分通信任务分别置于不同的通信任务执行队列PQ1和PQ2中之后,可以在通信任务执行队列PQ1和PQ2中并行地执行这两个分通信任务b1和b2。When the total communication task B is divided into sub-communication tasks b1 and b2, and these sub-communication tasks are placed in different communication task execution queues PQ1 and PQ2 respectively, the communication tasks can be executed in parallel in the communication task execution queues PQ1 and PQ2. Two sub-communication tasks b1 and b2.
总通信任务B与分通信任务的执行需要满足如下规则:1、当总通信任务B还未开始执行时,分通信任务b1和b2也应当处于未开始状态;2、当总通信任务B开始执行时,分通信任务b1和b2也应当开始执行;3、通信任务队列LQ中处于任务B之后的其他任务(例如C)需要等待任务B执行完毕之后才能执行;4、当分通信任务b1和b2全部执行完毕之后,总通信任务B也应当执行完毕。The execution of the general communication task B and the sub-communication task needs to meet the following rules: 1. When the general communication task B has not started to be executed, the sub-communication tasks b1 and b2 should also be in the unstarted state; 2. When the general communication task B starts to be executed time, the sub-communication tasks b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the communication task queue LQ need to wait for task B to be executed before they can be executed; 4. When all sub-communication tasks b1 and b2 are executed After the execution is completed, the general communication task B should also be executed.
图37a示出了根据本公开一个实施方式的将通信任务队列中的一个总通信任务划分为多个分通信任务的流程图。FIG. 37a shows a flowchart of dividing a total communication task in the communication task queue into a plurality of sub-communication tasks according to an embodiment of the present disclosure.
由此,根据本公开的一个实施方式,将通信任务队列中的一个总通信任务划分为多个分通信任务S3610包括:在操作S36110,设置允许所述总通信任务开始执行的第一写入标识;在操作S36120,设置禁止所述分通信任务开始执行的第一等待标识;以及,在操作S36130,当所述第一写入标识未被执行时,执行所述第一等待标识以禁止所述分通信任务开始执行。Thus, according to an embodiment of the present disclosure, dividing a total communication task in the communication task queue into a plurality of sub-communication tasks S3610 includes: in operation S36110, setting a first write flag that allows the total communication task to start executing ; In operation S36120, set the first waiting mark that prohibits described sub-communication task starting to execute; And, at operation S36130, when described first writing mark is not carried out, carry out described first waiting mark to forbid described The sub-communication task starts to execute.
图37b示出了根据本公开一个实施方式的在队列中插入标识的示意图。下面结合图37b来详细描述图37a的具体实施方式。Figure 37b shows a schematic diagram of inserting a marker in a queue according to one embodiment of the present disclosure. The specific implementation of FIG. 37a will be described in detail below in conjunction with FIG. 37b.
首先,为了控制通信任务B的执行,需要设置一个写入标识,换言之,需要在待执行的通信任务之前插入一个写入标识,这里示例性表示为F0,只有执行到该写入标识F0的时候,或者当该写入标识F0被改变为允许执行接下来的任务时,后续的通信任务B才开始执行。而如果没有执行到该写入标识F0,则相应的通信任务并不开始执行。可以通过原子操作(Atomic Operation)来插入该写入标识。所谓原子操作是指不会被线程调度机制打断的操作;这种操作一旦开始,就一直运行到结束,中间不会有任何上下文切换。First of all, in order to control the execution of the communication task B, a write flag needs to be set, in other words, a write flag needs to be inserted before the communication task to be executed, which is exemplarily represented as F0, only when the write flag F0 is executed , or when the write flag F0 is changed to allow the execution of the next task, the subsequent communication task B starts to be executed. However, if the writing flag F0 is not executed, the corresponding communication task does not start to be executed. The write flag can be inserted through an Atomic Operation. The so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.
相应地,可以在每个分通信任务之前插入等待标识f0,等待标识表示禁止该标识之后的分通信任务执行。需要理解的是,图37b中的第一写入标识F0和等待标识f0虽然采用了不同的名称来命名,但写入标识F0和等待标识f0指向同一个标识,以检测该同一个标识是否发生改变。还需要理解的是,图37b所示的第一写入标识F0和等待标识f0的插入位置仅仅是为了方便理解,而并不必然如图37b那样插入在分通信任务中。Correspondingly, a waiting flag f0 may be inserted before each sub-communication task, and the waiting flag indicates that the execution of the sub-communication task after the flag is prohibited. It should be understood that, although the first writing mark F0 and the waiting mark f0 in FIG. 37b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark to detect whether the same mark occurs. Change. It should also be understood that the insertion positions of the first writing flag F0 and the waiting flag f0 shown in FIG. 37b are only for the convenience of understanding, and are not necessarily inserted into the sub-communication tasks as shown in FIG. 37b.
根据本公开的一个实施方式,并行地执行所述多个分通信任务包括:响应于所述第一写入标识被执行,关断所述第一等待标识,从而并行地执行所述多个分通信任务。According to an embodiment of the present disclosure, executing the plurality of sub-communication tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-communication tasks in parallel communication tasks.
总通信任务之前的写入标识F0和等待标识f0呈现关联的关系,只有当写入标识F0允许执行后续的总通信任务时,才结束等待标识f0,并开始运行相应的分通信任务,而如果写入标识F0不允许执行后续的总通信任务,则等待标识f0也使得分通信任务的执行处于等待状态。The write flag F0 before the total communication task and the waiting flag f0 show an associated relationship. Only when the write flag F0 allows the execution of the subsequent total communication task, the waiting flag f0 is ended and the corresponding sub-communication task starts to run. The writing flag F0 does not allow the execution of the subsequent total communication task, and the waiting flag f0 also makes the execution of the sub-communication task in a waiting state.
图38示出了根据本公开另一个实施方式的队列示意图。Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure.
根据本公开的一个实施方式,可以设置第二等待标识,以禁止执行所述总通信任务之后的其他通信任务。According to an embodiment of the present disclosure, a second waiting flag may be set to prohibit execution of other communication tasks after the general communication task.
如图38所示,可以在第一写入标识后插入第二等待标识,当执行到该第二等待标识时,则表明该当前总通信任务之后的其他总通信任务需要处于等待状态,在该当前总通信任务没有执行完毕之前,其他总通信任务不能开始执行。As shown in Figure 38, a second waiting mark can be inserted after the first writing mark. When the second waiting mark is executed, it indicates that other general communication tasks after the current general communication task need to be in a waiting state. Before the current general communication task is executed, other general communication tasks cannot start to be executed.
根据上面的描述可以看出,当执行到第一写入标识F0时,该第一写入标识F0所对应的总通信任务B开始执行,即,该总通信任务B的分通信任务b1和b2结束等待状态并开始执行;此后,执行到第二等待标识F1时,则总通信任务B之后的其他任务进入等待状态,在执行总通信任务B的时候并不执行。According to the above description, it can be seen that when the first write flag F0 is executed, the general communication task B corresponding to the first write flag F0 starts to be executed, that is, the sub-communication tasks b1 and b2 of the general communication task B are executed. End the waiting state and start execution; after that, when the second waiting flag F1 is executed, other tasks after the general communication task B enter the waiting state, and are not executed when the general communication task B is executed.
图39示出了根据本公开一个实施方式的设置第二等待标识被修改的示意图。FIG. 39 shows a schematic diagram of setting the second waiting flag to be modified according to an embodiment of the present disclosure.
根据本公开的一个实施方式,每当一个分通信任务执行完毕,则修改所述第二等待标识F1,直至所有分通信任务执行完毕;以及响应于所有分通信任务执行完毕,将所述第二等待标识F1修改为等待结束标识,从而使得所述总通信任务执行完毕。According to an embodiment of the present disclosure, every time one sub-communication task is executed, the second waiting flag F1 is modified until all sub-communication tasks are executed; and in response to all sub-communication tasks being executed, the second The waiting flag F1 is modified into a waiting end flag, so that the execution of the total communication task is completed.
接下来,如图39所示,在执行队列PQ中开始执行每个分通信任务b1和b2,每当一个分通信任务b1或b2执行完毕,则可以相应地修改第二等待标识F1,例如可以使得第二等待标识F1加一。第二等待标识F1修改的次数与分通信任务被执行完毕的次数相同。因此,第二等待标识F1可以在初始的时候设定一个目标值,随着分通信任务b1或b2的执行完毕,第二等待标识F1逐渐接近该目标值,当第二等待标识F1达到预设目标值的时候,则意味着所有的分通信任务b1和b2执行完毕。需要理解的是,对第二等待标识F1的修改方式可以有很多种,而并不局限于如上文所述的“加一”,例如每修改一次可以减去一,直至该第二等待标识F1小于预定的阈值。本公开对于如何修改第二等待标识并不做任何限定。Next, as shown in FIG. 39, each sub-communication task b1 and b2 is executed in the execution queue PQ, and whenever a sub-communication task b1 or b2 is executed, the second waiting flag F1 can be modified accordingly, for example, it can be The second waiting flag F1 is incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-communication tasks are executed. Therefore, the second waiting flag F1 can be initially set with a target value, and as the sub-communication task b1 or b2 is executed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset value When the target value is reached, it means that all sub-communication tasks b1 and b2 have been executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.
上文所述的“第二等待标识F1达到目标值”也可以理解为一个等待结束标识,这意味着当前总通信任务B已经执行完毕,可以开始执行其他任务了。The above-mentioned "the second waiting flag F1 reaches the target value" can also be understood as a waiting end flag, which means that the current general communication task B has been executed, and other tasks can be started.
需要理解的是,在图37b,图38和图39中,尽管标识f0被显示在分通信任务中,但这仅仅是为了便于理解,该标识f0实际上处于通信任务队列中,以便于对各个分通信任务的执行进行监控。It should be understood that in Figure 37b, Figure 38 and Figure 39, although the flag f0 is shown in the sub-communication task, this is only for the convenience of understanding, and the flag f0 is actually in the communication task queue, so as to facilitate each The execution of sub-communication tasks is monitored.
在将总通信任务划分为多个分通信任务时,可以有多种划分方式,可以将总通信任务随机地划分为多个分通信任务;可以将总通信任务划分为固定数量的分通信任务;可以根据通信路径的数量来将总通信任务划分为与处理器的数量相应数量的分通信任务等等。When dividing the total communication task into a plurality of sub-communication tasks, there can be a variety of division methods, the total communication task can be randomly divided into a plurality of sub-communication tasks; the total communication task can be divided into a fixed number of sub-communication tasks; The total communication task may be divided into a number of sub-communication tasks corresponding to the number of processors, etc., according to the number of communication paths.
根据本公开的一个优选实施方式,可以将任务队列中的一个总通信任务划分为多个执行时间等效的分通信任务。According to a preferred embodiment of the present disclosure, a total communication task in the task queue may be divided into a plurality of sub-communication tasks with equivalent execution time.
上述的执行时间等效并不意味着每个分通信任务本身的大小是相同的。例如,每个端口的通信速度为40Gbps,那么对于160G的数据而言,如果采用一条通信路径来传输这些数据,则理论上需要4秒来完成数据的传输。因此,可以将该160G的数据拆分为多个分通信任务,例如2个,3个或4个。当拆分为4个分通信任务时,则可以采用4条通信路径来并行地进行数据传输,则理论上仅需要1秒即可完成160G的数据传输,通信时间仅为原通信时间的25%。显然,这样将有助于缩短对总通信任务的执行时间。The above execution time equivalence does not mean that the size of each sub-communication task itself is the same. For example, if the communication speed of each port is 40Gbps, then for 160G data, if one communication path is used to transmit these data, it theoretically takes 4 seconds to complete the data transmission. Therefore, the 160G data can be split into multiple sub-communication tasks, such as 2, 3 or 4. When it is divided into 4 sub-communication tasks, 4 communication paths can be used to transmit data in parallel. In theory, it only takes 1 second to complete 160G data transmission, and the communication time is only 25% of the original communication time. . Obviously, this will help to shorten the execution time of the total communication task.
此外,多条通信路径中的每条通信路径并不必然会有同样的传输速度,因此,对于通信任务的划分则可以考虑每个通信路径的速度来调整每个分通信任务所对应的数据的大小。例如,四条通信路径的传输速度分别为16Gbps,18Gbps,22Gbps和24Gbps,那么对于160G的数据而言,则可以将其分别拆分为四个分通信任务,分别为32G,36G,44G和48G,每个通信路径完成数据传输的时间为2秒,从而能够保证每条通信路径能够同时或基本同时完成各分通信任务。In addition, each communication path in the multiple communication paths does not necessarily have the same transmission speed. Therefore, for the division of communication tasks, the speed of each communication path can be considered to adjust the data corresponding to each sub-communication task. size. For example, the transmission speeds of the four communication paths are 16Gbps, 18Gbps, 22Gbps and 24Gbps respectively, then for 160G data, it can be divided into four sub-communication tasks, respectively 32G, 36G, 44G and 48G, The time for each communication path to complete data transmission is 2 seconds, thereby ensuring that each communication path can complete each communication task at the same time or substantially at the same time.
不同的通信路径可以对应于不同的拓扑结构,如图12a-图12c所描述的那样。Different communication paths may correspond to different topologies, as described in Figures 12a-12c.
根据本公开的一个实施方式,其中,可以响应于所述总通信任务的数据量超过特定阈值,将所述总通信任务划分为多个分通信任务。需要理解的是,将总通信任务划分为多个分通信任务还需要考虑每个任务所涉及的数据总量,如果某个总通信任务涉及的数据总量较小,那么将没有必要再将该总通信任务进行划分。According to an embodiment of the present disclosure, wherein, in response to a data amount of the total communication task exceeding a certain threshold, the total communication task may be divided into a plurality of sub-communication tasks. It should be understood that dividing the total communication task into multiple sub-communication tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a total communication task is small, then it is not necessary to Total communication tasks are divided.
根据本公开的一个实施方式,本公开的方法进一步包括:响应于一个或多个分通信任务出现错误,重新运行出现错误的分通信任务。According to one embodiment of the present disclosure, the method of the present disclosure further includes: in response to an error in one or more sub-communication tasks, re-running the sub-communication task in which the error occurred.
当在通信任务执行队列PQ中执行多个分通信任务时,有可能发生错误,例如数据线路导致的传输失败,出现数据吞吐错误,出现数据传输出现丢包等。在传统的方案中,如果总通信任务不被划分为多个分通信任务,一旦任务的执行过程中发生错误,则需要将整个总通信任务重新执行一次,从而会严重地浪费处理能力,造成系统整体性能的下降。When multiple sub-communication tasks are executed in the communication task execution queue PQ, errors may occur, such as transmission failure caused by data lines, data throughput errors, and packet loss in data transmission. In the traditional solution, if the total communication task is not divided into multiple sub-communication tasks, once an error occurs during the execution of the task, the entire total communication task needs to be re-executed, which will seriously waste the processing capacity and cause the system Degradation of overall performance.
在本公开的方案中,由于多个分通信任务均处于不同的执行队列中,这些执行队列之间独立运行,互相不发生干涉,因此即使某一个分通信任务在执行过程中发生错误,也不会影响其他分通信任务的执行。因此,如果一个分通信任务的执行出现错误,那么可以仅重新执行该出现错误的分通信任务即可,而无需将全部的分通信任务或者总通信任务整体重新运行一次。在运行该出现错误的分通信任务时,其他队列可以处于空闲状态,或者可以同时执行其他分通信任务。因此,本公开中将一个总通信任务划分为多个并行的分通信任务的情况能够提升系统处理资源的利用率,并且提升处理效率。In the solution of the present disclosure, since multiple sub-communication tasks are in different execution queues, these execution queues run independently and do not interfere with each other, so even if an error occurs in a certain sub-communication task during the execution process, the It will affect the execution of other sub-communication tasks. Therefore, if an error occurs in the execution of a sub-communication task, only the sub-communication task in which the error has occurred can be re-executed without re-running all the sub-communication tasks or the entire communication task as a whole. When the faulty sub-communication task is running, other queues may be in an idle state, or other sub-communication tasks may be executed simultaneously. Therefore, in the present disclosure, dividing a total communication task into a plurality of parallel sub-communication tasks can improve the utilization rate of system processing resources and improve processing efficiency.
根据本公开的一个实施方式,其中,响应于一个或多个分通信任务出现错误,将出现错误的分通信任务进一步拆分为多个子任务以便于并行执行。According to an embodiment of the present disclosure, in response to an error occurred in one or more sub-communication tasks, the sub-communication task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
当一个分通信任务出现错误而需要重新执行时,可以将出现错误的分通信任务作为一个新的总通信任务加入到通信任务分配队列LQ中,并将该分通信任务进一步划分为多个子任务,并在多个并行的执行队列PQ中重新执行一次该出现错误的分通信任务。将出现错误的分通信任务进一步划分为多个子任务重新执行进一步提升了系统的运行效率,使得即使某个分通信任务的执行发生错误,改正该错误所花费的时间和处理资源都大大降低。When an error occurs in a sub-communication task and needs to be re-executed, the sub-communication task with the error can be added to the communication task allocation queue LQ as a new total communication task, and the sub-communication task is further divided into multiple sub-tasks, And re-execute the faulty sub-communication task in multiple parallel execution queues PQ. Dividing the faulty sub-communication task into multiple sub-tasks for re-execution further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-communication task, the time and processing resources for correcting the error are greatly reduced.
由此可见,基于本公开的上述方案,在通信数据量较大的时候,可以将一个通信任务拆为任意多个分任务,下发到多个不同的通信任务执行队列中并发执行,提高带宽利用率。It can be seen that, based on the above solution of the present disclosure, when the amount of communication data is large, a communication task can be divided into any number of sub-tasks, and sent to multiple different communication task execution queues for concurrent execution, thereby increasing the bandwidth utilization.
进一步地,可以根据芯片之间的具体拓扑连接方式的最大带宽选择不同的算法进行通信,进一步优化大数据量的通信效率。由此,可以根据加速卡之间的物理拓扑连接,灵活的构建通信逻辑拓扑进行数据通信,进一步提高通信效率。Further, different algorithms can be selected for communication according to the maximum bandwidth of the specific topology connection mode between chips, so as to further optimize the communication efficiency of large data volumes. Therefore, according to the physical topology connection between the accelerator cards, a communication logical topology can be flexibly constructed for data communication, and the communication efficiency can be further improved.
还可以独立地监管在每个通信任务执行队列中执行的分通信任务,如果分通信任务出现了错误,则只需要将这个分通信任务重新下发或重新执行,不需要将整个通信任务重新执行。从而可以在用户无感知的情况下,实现通信任务的部分重传处理,降低了通信任务的容错和重传的代价,提高了整体的通信效率。It can also independently supervise the sub-communication tasks executed in each communication task execution queue. If there is an error in the sub-communication task, you only need to re-issue or re-execute the sub-communication task, and do not need to re-execute the entire communication task. . Therefore, the partial retransmission processing of the communication task can be realized without the user's perception, the fault tolerance of the communication task and the cost of retransmission are reduced, and the overall communication efficiency is improved.
下发任务的加速卡可以是加速卡系统中的任意一个加速卡。根据本公开的一个实施方式,由于通信任务队列只是存在等待和写入操作,真正的通信任务是在通信任务执行队列里面运行的,因此通信任务队列可以对应加速卡系统中的任意一个,这有利于降低开发人员编程出错的概率,并且可以选择任务量较小的加速卡去执行通信任务队列的等待和写入控制。The accelerator card that delivers the task can be any accelerator card in the accelerator card system. According to an embodiment of the present disclosure, since the communication task queue only has waiting and writing operations, the real communication task is executed in the communication task execution queue, so the communication task queue can correspond to any one of the accelerator card systems, which includes It is beneficial to reduce the probability of programming errors by developers, and an accelerator card with a small amount of tasks can be selected to perform the waiting and write control of the communication task queue.
本公开还提供一种电子设备,包括:一个或多个处理器;以及存储器,所述存储器 中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如上所述的方法。The present disclosure also provides an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored therein, when the computer-executable instructions are executed by the one or more processors , so that the electronic device executes the method as described above.
本公开还提供一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由一个或多个处理器运行时,执行如上所述的方法。The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.
本公开的技术方案可应用于人工智能领域,实现为或者实现在人工智能芯片中。该芯片可以单独存在,也可以包含在计算装置中。The technical solutions of the present disclosure can be applied to the field of artificial intelligence, and are implemented as or in an artificial intelligence chip. The chip can exist alone or can be included in a computing device.
图33为本公开一个实施例中组合处理装置结构示意图。FIG. 33 is a schematic structural diagram of a combined processing apparatus according to an embodiment of the present disclosure.
图33示出了一种组合处理装置3300,其包括上述的加速单元3301,互联接口3302,其他处理装置3303和存储装置3304。根据本公开的计算装置与其他处理装置进行交互,共同完成用户指定的操作。图33为组合处理装置的示意图。FIG. 33 shows a combined processing device 3300 , which includes the aforementioned acceleration unit 3301 , an interconnection interface 3302 , other processing devices 3303 and a storage device 3304 . The computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user. Figure 33 is a schematic diagram of a combined treatment device.
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为机器学习运算装置与外部数据和控制的接口,包括数据搬运,完成对本机器学习运算装置的开启、停止等基本控制;其他处理装置也可以和机器学习运算装置协作共同完成运算任务。Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data transfer, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.
互联接口用于在计算装置(包括例如机器学习运算装置)与其他处理装置间传输数据和控制指令。该计算装置从其他处理装置中获取所需的输入数据,写入该计算装置片上的存储装置;可以从其他处理装置中获取控制指令,写入计算装置片上的控制缓存;也可以读取计算装置的存储模块中的数据并传输给其他处理装置。The interconnect interface is used to transfer data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices. The computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device The data in the storage module is transmitted to other processing devices.
可选的,该结构还可以包括存储装置2608,存储装置分别与所述计算装置和所述其他处理装置连接。存储装置用于保存在所述计算装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本计算装置或其他处理装置的内部存储中无法全部保存的数据。Optionally, the structure may further include a storage device 2608, and the storage device is respectively connected to the computing device and the other processing device. The storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
在一些实施例里,本公开还公开了一种芯片封装结构,其包括了上述芯片。In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip.
在一些实施例里,本公开还公开了一种板卡,其包括了上述芯片封装结构。参阅图34,其提供了一种示例性的板卡3400,上述板卡3400除了包括上述芯片以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件3401、接口装置3407、控制器件3405和加速单元3406。In some embodiments, the present disclosure also discloses a board including the above chip package structure. Referring to FIG. 34, an exemplary board 3400 is provided. In addition to the above-mentioned chip, the above board 3400 may also include other supporting components, including but not limited to: a storage device 3401, an interface device 3407, Control device 3405 and acceleration unit 3406.
所述存储器件与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元3402。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 3402 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。在一个实施例中,每一组所述存储单元包 括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备3408(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 3408 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和/或多个处理电路的工作状态的调控。The control device is electrically connected to the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit, MCU). For example, the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.
在一些实施例里,本公开还公开了一种电子设备或装置,其包括了上述板卡。In some embodiments, the present disclosure also discloses an electronic device or device, which includes the above board.
电子设备或装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches , headsets, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequences. Because certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.
在本公开所提供的几个实施例中,应该理解到,所公开的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性、光学、声学、磁性或其它的形式。In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是 各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.
所述集成的单元如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,当本公开的技术方案可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network device). etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本公开的方法及其核心思想;同时,对于本领域的一般技术人员,依据本公开的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本公开的限制。2021100550976The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, based on the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limiting the present disclosure. 2021100550976

Claims (16)

  1. 一种执行异步任务的方法,包括:A method of executing asynchronous tasks, including:
    将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;Divide a total task in the task queue into multiple sub-tasks, and each sub-task is in a different sub-task queue;
    并行地执行所述多个分任务;executing the plurality of subtasks in parallel;
    响应于所述分任务执行完毕,从而使得所述总任务执行完毕。In response to the completion of the execution of the sub-task, the execution of the total task is completed.
  2. 根据权利要求1所述的方法,其中,将任务队列中的一个总任务划分为多个分任务包括:The method according to claim 1, wherein dividing a total task in the task queue into a plurality of sub-tasks comprises:
    在所述队列中插入允许所述总任务开始执行的第一写入标识;inserting a first write identifier that allows the total task to start executing in the queue;
    在所述分任务队列中插入禁止所述分任务开始执行的第一等待标识;Insert the first waiting sign that forbids the start of execution of the sub-task in the sub-task queue;
    当所述第一写入标识未被执行时,执行所述第一等待标识以禁止所述分任务开始执行。When the first writing flag is not executed, the first waiting flag is executed to prohibit the sub-task from starting to be executed.
  3. 根据权利要求2所述的方法,其中,并行地执行所述多个分任务包括:The method of claim 2, wherein executing the plurality of subtasks in parallel comprises:
    响应于所述第一写入标识被执行,关断所述第一等待标识,从而并行地执行所述多个分任务。In response to the first write flag being executed, the first wait flag is turned off, thereby executing the plurality of subtasks in parallel.
  4. 根据权利要求1-3中任意一项所述的方法,进一步包括:在所述总任务队列中插入第二等待标识,以禁止执行所述总任务之后的其他任务。The method according to any one of claims 1-3, further comprising: inserting a second waiting flag in the total task queue to prohibit execution of other tasks after the total task.
  5. 根据权利要求4所述的方法,进一步包括:The method of claim 4, further comprising:
    每当一个分任务执行完毕,则修改所述第二等待标识,直至所有分任务执行完毕;Whenever a sub-task is executed, the second waiting flag is modified until all sub-tasks are executed;
    响应于所有分任务执行完毕,将所述第二等待标识修改为等待结束标识,从而使得所述总任务执行完毕。In response to the completion of execution of all sub-tasks, the second waiting flag is modified to a waiting end flag, so that the execution of the total task is completed.
  6. 根据权利要求1-5中任意一项所述的方法,其中将任务队列中的一个总任务划分为多个执行时间等效的分任务。The method according to any one of claims 1-5, wherein a total task in the task queue is divided into a plurality of sub-tasks with equivalent execution time.
  7. 根据权利要求1-6中任意一项所述的方法,其中,响应于所述总任务的数据量超过特定阈值,将所述总任务划分为多个分任务。The method of any one of claims 1-6, wherein the total task is divided into a plurality of sub-tasks in response to a data amount of the total task exceeding a certain threshold.
  8. 根据权利要求1-7中任意一项所述的方法,进一步包括:响应于一个或多个分任务出现错误,重新运行出现错误的分任务。7. The method of any one of claims 1-7, further comprising: in response to one or more subtasks having an error, re-running the erroneous subtask.
  9. 根据权利要求1-8中任意一项所述的方法,响应于一个或多个分任务出现错误,将出现错误的分任务进一步拆分为多个子任务以便于并行执行。According to the method of any one of claims 1-8, in response to an error in one or more sub-tasks, the sub-task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
  10. 根据权利要求1-9中任意一项所述的方法,其中,所述任务队列为通信任务队列。The method according to any one of claims 1-9, wherein the task queue is a communication task queue.
  11. 一种执行异步任务的装置,包括:An apparatus for executing asynchronous tasks, comprising:
    划分单元,配置为将任务队列中的一个总任务划分为多个分任务,每个分任务处于不同的分任务队列中;A division unit, configured to divide a total task in the task queue into multiple sub-tasks, and each sub-task is in a different sub-task queue;
    分任务执行单元,配置为并行地执行所述多个分任务;a sub-task execution unit configured to execute the plurality of sub-tasks in parallel;
    结束单元,配置为响应于所述分任务执行完毕,从而使得所述总任务执行完毕。The ending unit is configured to complete the execution of the total task in response to the completion of the execution of the sub-tasks.
  12. 一种芯片,包括如权利要求11所述的装置。A chip comprising the apparatus of claim 11 .
  13. 一种电子设备,包括如权利要求12所述的芯片。An electronic device comprising the chip of claim 12 .
  14. 一种在加速卡系统中执行通信任务的方法,其中,所述加速卡系统包括多个能够互相通信的加速卡,所述多个加速卡中的一个加速卡能够通过通信路径与另一个加速卡进行通信;所述方法包括:A method for performing a communication task in an accelerator card system, wherein the accelerator card system includes a plurality of accelerator cards that can communicate with each other, and one accelerator card in the plurality of accelerator cards can communicate with another accelerator card through a communication path communicating; the method includes:
    建立通信任务队列,所述通信任务队列中包括通信任务和用于对所述通信任务的执行状态进行监控的状态标识;establishing a communication task queue, the communication task queue includes a communication task and a state identifier for monitoring the execution state of the communication task;
    建立通信任务执行队列,用于通过通信路径在加速卡之间执行通信任务;Establish a communication task execution queue for executing communication tasks between accelerator cards through a communication path;
    响应于所述通信任务的执行,改变所述状态标识符以监控所述通信任务的执行状态。In response to the execution of the communication task, the status identifier is changed to monitor the execution status of the communication task.
  15. 一种电子设备,包括:An electronic device comprising:
    一个或多个处理器;以及one or more processors; and
    存储器,所述存储器中存储有计算机可执行指令,当所述计算机可执行指令由所述一个或多个处理器运行时,使得所述电子设备执行如权利要求1-10及14中任意一项所述的方法。a memory having computer-executable instructions stored therein which, when executed by the one or more processors, cause the electronic device to perform any one of claims 1-10 and 14 the method described.
  16. 一种计算机可读存储介质,包括计算机可执行指令,当所述计算机可执行指令由A computer-readable storage medium comprising computer-executable instructions, when the computer-executable instructions are
    一个或多个处理器运行时,执行如权利要求1-10及14中任意一项所述的方法。The one or more processors are running to perform the method of any one of claims 1-10 and 14.
PCT/CN2021/138702 2020-12-30 2021-12-16 Method for executing asynchronous task, device, and computer program product WO2022143194A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011610670.7A CN114691311A (en) 2020-12-30 2020-12-30 Method, device and computer program product for executing asynchronous task
CN202011610670.7 2020-12-30
CN202110055097.6 2021-01-15
CN202110055097.6A CN114764374A (en) 2021-01-15 2021-01-15 Method and equipment for executing communication task in accelerator card system

Publications (1)

Publication Number Publication Date
WO2022143194A1 true WO2022143194A1 (en) 2022-07-07

Family

ID=82260239

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138702 WO2022143194A1 (en) 2020-12-30 2021-12-16 Method for executing asynchronous task, device, and computer program product

Country Status (1)

Country Link
WO (1) WO2022143194A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116339944A (en) * 2023-03-14 2023-06-27 海光信息技术股份有限公司 Task processing method, chip, multi-chip module, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106358003A (en) * 2016-08-31 2017-01-25 华中科技大学 Video analysis and accelerating method based on thread level flow line
CN109086138A (en) * 2018-08-07 2018-12-25 北京京东金融科技控股有限公司 Data processing method and system
CN109828833A (en) * 2018-11-02 2019-05-31 上海帆一尚行科技有限公司 A kind of queuing system and its method of neural metwork training task
CN111090511A (en) * 2019-12-24 2020-05-01 北京推想科技有限公司 Task processing method and device and computer readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106358003A (en) * 2016-08-31 2017-01-25 华中科技大学 Video analysis and accelerating method based on thread level flow line
CN109086138A (en) * 2018-08-07 2018-12-25 北京京东金融科技控股有限公司 Data processing method and system
CN109828833A (en) * 2018-11-02 2019-05-31 上海帆一尚行科技有限公司 A kind of queuing system and its method of neural metwork training task
CN111090511A (en) * 2019-12-24 2020-05-01 北京推想科技有限公司 Task processing method and device and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116339944A (en) * 2023-03-14 2023-06-27 海光信息技术股份有限公司 Task processing method, chip, multi-chip module, electronic device and storage medium
CN116339944B (en) * 2023-03-14 2024-05-17 海光信息技术股份有限公司 Task processing method, chip, multi-chip module, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US11809360B2 (en) Network-on-chip data processing method and device
TW201805858A (en) Method for performing neural network computation and apparatus
WO2012130134A1 (en) Computer system
WO2022143194A1 (en) Method for executing asynchronous task, device, and computer program product
CN112686379B (en) Integrated circuit device, electronic apparatus, board and computing method
CN115994107B (en) Access acceleration system of storage device
CN114764374A (en) Method and equipment for executing communication task in accelerator card system
WO2023040197A1 (en) Cross-node communication method and apparatus, device, and readable storage medium
CN117687956B (en) Multi-acceleration-card heterogeneous server and resource link reconstruction method
CN117493237B (en) Computing device, server, data processing method, and storage medium
CN111767995A (en) Operation method, device and related product
EP4141685A1 (en) Method and device for constructing communication topology structure on basis of multiple processing nodes
EP4142217A1 (en) Inter-node communication method and device based on multiple processing nodes
CN114691311A (en) Method, device and computer program product for executing asynchronous task
CN110413564A (en) AI trains inference service device, system and method
US20210182110A1 (en) System, board card and electronic device for data accelerated processing
WO2022057600A1 (en) Acceleration unit, acceleration assembly, acceleration device, and electronic device
US12050545B2 (en) Method and device for constructing communication topology structure on basis of multiple processing nodes
CN212846786U (en) Accelerating unit and electronic equipment
CN111258732A (en) Data processing method, data processing device and electronic equipment
CN111210011B (en) Data processing device and related product
CN212846785U (en) Acceleration assembly, acceleration device and electronic equipment
CN105207823B (en) A kind of network topology structure
WO2024119869A1 (en) Method for executing inter-chip communication task, and related product
CN111767999A (en) Data processing method and device and related products

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21913935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21913935

Country of ref document: EP

Kind code of ref document: A1