WO2022143194A1

WO2022143194A1 - Method for executing asynchronous task, device, and computer program product

Info

Publication number: WO2022143194A1
Application number: PCT/CN2021/138702
Authority: WO
Inventors: 柴安晨; 吕尧; 梁帆
Original assignee: 安徽寒武纪信息科技有限公司
Priority date: 2020-12-30
Filing date: 2021-12-16
Publication date: 2022-07-07

Abstract

A method for executing an asynchronous task and a device. The method can be implemented in a computing apparatus. The computing apparatus can be comprised in a combined processing apparatus. The combined processing can also comprise a universal interconnection interface and an another processing apparatus. The computing apparatus interacts with the another processing apparatus to jointly complete a computing operation specified by a user. The combined processing apparatus can also comprise a storage apparatus. The storage apparatus is separately connected to the computing apparatus and the another processing apparatus, and is used for storing data of the computing apparatus and the another processing apparatus.

Description

A method, apparatus and computer program product for performing asynchronous tasks

CROSS-REFERENCE TO RELATED APPLICATIONS

This application requires the application on December 30, 2020, the application number is 2020116106707, and the title is "A method, device and computer program product for performing asynchronous tasks"; The application was filed on January 15, 2021, and the application number is 2021100550976 , the priority of the Chinese patent application entitled "A Method and Apparatus for Performing Communication Tasks in an Accelerator Card System".

technical field

The present disclosure relates to the field of computers, and more particularly, to serial and parallel execution of tasks.

Background technique

In the current deep network training process, in order to accelerate the speed of network training convergence, some or even all training tasks (including computing tasks, communication tasks, control logic tasks, etc.) are usually sent to special acceleration chips for execution ( Such as GPU, MLU, TPU, etc.).

The network training task will be sent by the CPU to the accelerator card for execution in an asynchronous form. The accelerator card has the concept of a task queue. The tasks on the same queue will be executed in the order in which they are issued. Therefore, the tasks on the same queue have dependencies and different tasks. Tasks on the queue can execute concurrently based on the availability of hardware resources. However, the current training tasks are usually only executed in one queue, which will inevitably affect the execution efficiency of the tasks.

The current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks. When the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs. Currently, the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency. In addition, in the prior art, it is impossible for the user to perform the operation of fault tolerance or retransmission of the communication task without the user's perception.

SUMMARY OF THE INVENTION

2020116106707 One purpose of the present disclosure is to overcome the defects in the prior art that communication or computing resources cannot be fully utilized and fault tolerance is low.

According to a first aspect of the present disclosure, there is provided a method for executing an asynchronous task, comprising: dividing a total task in a task queue into a plurality of sub-tasks, each sub-task being in a different sub-task queue; executing all the tasks in parallel The plurality of sub-tasks are completed; in response to the completion of the execution of the sub-tasks, the execution of the total task is completed.

According to a second aspect of the present disclosure, an apparatus for executing asynchronous tasks includes: a dividing unit configured to divide a total task in a task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; The sub-task execution unit is configured to execute the plurality of sub-tasks in parallel; the end unit is configured to complete the execution of the total task in response to completion of the sub-task execution.

According to a third aspect of the present disclosure, there is provided a chip comprising the apparatus as described above.

According to a fourth aspect of the present disclosure, there is provided an electronic device including the chip as described above.

According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors When a processor runs, the electronic device executes the method as described above.

According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.

The technical solution of the present disclosure can allocate a total task to different sub-task queues, thereby accelerating the execution of the total task. In addition, even if there is an error in the execution of a sub-task queue, there is no need to re-execute all tasks, thereby reducing the cost of task fault tolerance or retransmission, reducing the burden of task execution, and realizing tasks without the user's perception. fault tolerance or retransmission processing. 2020116106707

2021100550976 One purpose of the present disclosure is to overcome the defects in the prior art that a task cannot be issued to multiple queues for parallel execution, communication or computing resources cannot be fully utilized, and fault tolerance is low.

According to a first aspect of the present disclosure, there is provided a method for performing a communication task in an accelerator card system, wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one accelerator card among the plurality of accelerator cards can communicate with another accelerator card through a communication path; the method includes: establishing a communication task queue, the communication task queue includes a communication task and a state identifier for monitoring the execution state of the communication task; establishing communication The task execution queue is used for executing communication tasks between the acceleration cards through the communication path; in response to the execution of the communication tasks, the state identifier is changed to monitor the execution state of the communication tasks.

According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored in the memory, when the computer-executable instructions are executed by the one or more processors The processor, when executed, causes the electronic device to perform the method as described above.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

At least one beneficial effect of the technical solutions of the present disclosure is to distinguish the communication task queue and the communication task execution queue, so that the user can perform tasks such as fault tolerance or retransmission without perception. The technical solution of the present disclosure can also allocate one general communication task to different sub-communication task queues, thereby accelerating the execution of the general communication task. In addition, even if an error occurs in the execution of a certain sub-communication task queue, it is not necessary to re-execute all the sub-communication tasks, thereby reducing the burden of task execution. 2021100550976

Description of drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily understood by reading the following detailed description with reference to the accompanying drawings. In the accompanying drawings, several embodiments of the present disclosure are shown by way of example and not limitation, and like or corresponding reference numerals refer to like or corresponding parts, wherein:

Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure;

Figure 1b shows a schematic diagram of a task issuing queue and a task execution queue according to an embodiment of the present disclosure;

Figure 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure;

Fig. 2b shows a schematic diagram of inserting a logo in a queue according to an embodiment of the present disclosure;

FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of inserting a second waiting flag being modified according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of an apparatus for executing an asynchronous task according to an embodiment of the present disclosure;

Figure 6 shows a combined processing device;

Figure 7 shows an exemplary board;

8a is a schematic structural diagram of an acceleration unit in an embodiment disclosed;

8b, 9, 10, 11, and 12a-12c are multiple schematic structural diagrams of acceleration units according to embodiments of the present disclosure;

FIG. 13-FIG. 18 are schematic structural diagrams of acceleration components according to an embodiment of the present disclosure;

19a-19c are schematic diagrams showing acceleration components as network topology;

20 is a schematic diagram of an acceleration device including a plurality of acceleration units according to an embodiment of the present disclosure;

21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment;

22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment;

23-27 are multiple schematic diagrams of an acceleration device including multiple acceleration components according to an embodiment of the present disclosure;

28 is a schematic diagram of a network topology of another acceleration device;

FIG. 29 is a schematic diagram of a matrix network topology based on the wireless extension of an acceleration device;

FIG. 30 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure;

31 is a schematic diagram of a network topology of another acceleration device;

32 is a schematic diagram of a network topology of another acceleration device;

33 is a schematic structural diagram of a combination device according to an embodiment of the disclosure;

34 is a schematic structural diagram of a board in an embodiment of the disclosure;

35 shows a flowchart of a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure;

Figure 36a shows a flowchart of a method for performing a communication task according to one embodiment of the present disclosure;

Figure 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure;

Figure 37a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure;

Figure 37b shows a schematic diagram of inserting a logo in a queue according to one embodiment of the present disclosure;

Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure; and

FIG. 39 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.

It should be understood that the terms "first", "second", "third" and "fourth" in the claims, description and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order . The terms "comprising" and "comprising" as used in the specification and claims of the present disclosure indicate the presence of the described feature, integer, step, operation, element and/or component, but do not exclude one or more other features, integers , step, operation, element, component and/or the presence or addition of a collection thereof.

It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should also be further understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

The embodiments of the present disclosure are described above in detail, and specific examples are used herein to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. Meanwhile, any changes or modifications made by those skilled in the art based on the ideas of the present disclosure, based on the specific embodiments and application scope of the present disclosure, all belong to the protection scope of the present disclosure. In conclusion, the contents of this specification should not be construed as limiting the present disclosure.

The current mainstream frameworks (such as Tensorflow, Pytorch) only use a dedicated communication queue (comm_queue) to perform communication tasks. When the communication library responsible for the communication task acquires the task, it usually directly sends the task to the framework's comm_queue or the internal_queue of the communication library for execution, such as the communication library NCCL responsible for communication between GPUs. Currently, the communication tasks are all executed in a queue, and when an error occurs in the communication task, the communication task needs to be re-executed from the beginning, thereby reducing the overall communication efficiency.

In the present disclosure, communication and computing tasks are usually dispatched as asynchronous tasks to different task queues (queues) on acceleration chips (such as GPU, MLU) for execution, and asynchronous tasks on the same queue will be dispatched according to tasks Sequentially executed serially, tasks on different queues can be executed concurrently.

It should be understood that although the communication task is used as an example above, the tasks in this paper are not limited to communication tasks, but also involve various tasks such as operation or training of neural networks.

Fig. 1a shows a flowchart of a method for executing an asynchronous task according to an embodiment of the present disclosure; Fig. 1b shows a schematic diagram of a task delivery queue and a task execution queue according to an embodiment of the present disclosure.

As shown in FIG. 1 , the method of the present disclosure includes: in operation S110, dividing a total task in the task queue into a plurality of sub-tasks, and each sub-task is in a different sub-task queue; in operation S120, executing in parallel the plurality of sub-tasks; and in operation S130, in response to the execution of the sub-tasks being completed, the execution of the total task is completed.

The above method will be described in detail below with reference to FIG. 1b.

In Figure 1b, two types of queues are included, namely the task allocation queue LQ and the task execution queue PQ. The task delivery queue can receive multiple tasks, such as tasks A, B and C, etc. These tasks A, B and When C enters the task allocation queue LQ, it is serially combined, and the execution order is A, B, and C. That is, when task A is executed, tasks B and C need to wait, while task B only waits until task A is executed. It can only be executed after completion; while task C needs to wait for task B to be executed before it can be executed. Such a task execution method cannot make full use of the parallel running resources of the system, especially when the execution time of a certain task is particularly long or the amount of communication data is particularly large, the execution of other tasks will be obviously blocked, and the system performance will be affected.

The tasks in the task allocation queue LQ can be regarded as a total task, and the total task is divided into multiple sub-tasks executed in parallel, and placed in the task execution queue PQ for execution. When a total task is divided into multiple sub-tasks and executed in parallel, the execution efficiency of the task can be significantly improved.

In the present disclosure, taking the overall task B as an example, the overall task B can be divided into a plurality of sub-tasks b1, b2, etc., and two sub-tasks b1 and b2 are used as an example for description here. It should be noted that the number of divided tasks may be other numbers, which depend on the execution capability of the divided tasks and/or the size of the total tasks. For example, if the execution ability of each sub-task is strong, the total task can be divided into a smaller number of sub-tasks; and in the case of the same execution ability, if a certain total task is larger, then the total task can be divided into a smaller number of sub-tasks; Divide into a greater number of subtasks.

When the total task B is divided into sub-tasks b1 and b2, and these sub-tasks are placed in different execution queues PQ1 and PQ2 respectively, the two sub-tasks b1 and b2 can be executed in parallel in the execution queues PQ1 and PQ2 .

The execution of total task B and sub-tasks needs to meet the following rules: 1. When the overall task B has not started to be executed, the sub-tasks b1 and b2 should also be in the unstarted state; 2. When the overall task B starts to be executed, the sub-task b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the task allocation queue LQ need to wait for the completion of task B before they can be executed; 4. When the sub-tasks b1 and b2 are all executed, the total task B should also be completed.

FIG. 2a shows a flowchart of dividing a total task in the task queue into a plurality of sub-tasks according to an embodiment of the present disclosure.

Thus, according to an embodiment of the present disclosure, dividing a total task in the task queue into a plurality of sub-tasks S110 includes: in operation S1110, inserting a first write in the queue that allows the total task to start executing identification; in operation S1120, insert a first waiting mark that prohibits the start of execution of the sub-task in the sub-task queue; and, in operation S1130, when the first write mark is not executed, execute the first waiting mark A wait flag to inhibit the subtask from starting execution.

FIG. 2b shows a schematic diagram of inserting a marker in a queue according to an embodiment of the present disclosure. The specific implementation of FIG. 2a will be described in detail below in conjunction with FIG. 2b.

First, in order to control the execution of task B, it is necessary to insert a write flag before the task to be executed, which is exemplarily represented as F0, only when the write flag F0 is executed, or when the write flag F0 is changed to When the next task is allowed to be executed, the subsequent task B starts to be executed. And if the writing flag F0 is not executed, the corresponding task does not start executing. The write flag can be inserted through an Atomic Operation. The so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.

Correspondingly, a waiting flag f0 may be inserted before each subtask, and the waiting flag indicates that the execution of the subtasks after the flag is prohibited. It needs to be understood that although the first writing mark F0 and the waiting mark f0 in FIG. 2b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark, so as to detect whether the same mark occurs. Change.

According to an embodiment of the present disclosure, executing the plurality of sub-tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-tasks in parallel .

The writing flag F0 before writing the total task and the waiting flag f0 before the sub-task are related. Only when the writing flag F0 allows the execution of the subsequent total task, the waiting flag f0 is ended and the corresponding sub-task is started. , and if the writing flag F0 does not allow the execution of the subsequent total tasks, the waiting flag f0 also makes the execution of the sub-tasks in a waiting state.

FIG. 3 shows a schematic diagram of a queue according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, a second waiting flag may be inserted into the total task queue to prohibit execution of other tasks after the total task.

As shown in Figure 3, in the overall task queue, a second waiting mark can be inserted after the first writing mark, and when the second waiting mark is executed, it indicates that other overall tasks after the current overall task need to be waiting state, other total tasks cannot start to execute until the current total task is executed.

It can be seen from the above description that when the first write flag F0 is executed in the allocation queue, the total task B corresponding to the first write flag F0 starts to be executed, that is, the sub-tasks b1 and b2 ends the waiting state and starts execution; after that, when the second waiting flag F1 in the allocation queue is executed, other tasks after the total task B in the allocation queue enter the waiting state, and are not executed when the total task B is executed.

FIG. 4 is a schematic diagram illustrating that the insertion of the second waiting flag is modified according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, each time one sub-task is executed, the second waiting flag F1 is modified until all sub-tasks are executed; and in response to all sub-tasks being executed, the second waiting flag F1 is changed. The modification is to wait for the end flag, so that the execution of the total task is completed.

Next, as shown in FIG. 4 , each subtask b1 and b2 is executed in the execution queue PQ. Whenever a subtask b1 or b2 is executed, the second waiting flag F1 can be modified accordingly, for example, the first 2 Wait for the flag F1 to be incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-tasks are executed. Therefore, the second waiting flag F1 can be initially set with a target value, and as the execution of the sub-task b1 or b2 is completed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset target When the value is set, it means that all sub-tasks b1 and b2 are executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.

The above-mentioned "the second waiting flag F1 reaches the target value" can also be understood as a waiting end flag, which means that the current total task B has been executed, and other tasks can be executed.

When dividing the total task into multiple sub-tasks, there can be various ways of division. The total task can be randomly divided into multiple sub-tasks; the total task can be divided into a fixed number of sub-tasks; The number of processors in the queue PQ is used to divide the total task into the number of sub-tasks corresponding to the number of processors, and so on.

According to a preferred embodiment of the present disclosure, a total task in the task queue may be divided into multiple sub-tasks with equivalent execution time.

The above execution time equivalence does not imply that each subtask itself is the same size. For example, for 100M of computing data, there are 4 processing cores involved in the operation. In theory, each processing core can participate in 25M operations. In this way, the 4 processing cores will complete the operation in the same time, thereby reducing the total as much as possible. operation time. However, because a certain processing core also participates in other computing work, its processing capability is lower than that of other processing cores, so the respective processing capabilities of the four processing cores should be considered to allocate corresponding tasks, so that each processing core can complete the operation. The time is the same or about the same, which will help reduce the overall run time of the total task. Therefore, the principle of dividing the total task into multiple sub-tasks is to divide the tasks according to the capabilities of the resources that execute the tasks, so that multiple resources can be equivalent in processing time.

According to an embodiment of the present disclosure, wherein, in response to the data amount of the total task exceeding a certain threshold, the total task is divided into a plurality of sub-tasks. It should be understood that dividing the total task into multiple sub-tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a certain task is small, and the processing time of the total task is less than When the data generated by executing the total task is sent out, it will not be necessary to divide the total task. Similarly, if the time for reading data required by the total task constitutes a bottleneck, that is, the time for reading data is greater than the time for executing the total task, there is no need to further divide the total task.

According to one embodiment of the present disclosure, the method of the present disclosure further includes: in response to one or more sub-tasks having an error, re-running the faulty sub-task.

When multiple sub-tasks are executed in the execution queue PQ, errors may occur, such as an error in the operation result during the execution process, an error in data throughput, an error in data transmission, and so on. In the traditional scheme, if the total task is not divided into multiple sub-tasks, once an error occurs during the execution of the task, the entire total task needs to be re-executed, which will seriously waste the processing power and cause the overall performance of the system to deteriorate. decline.

In the solution of the present disclosure, since multiple sub-tasks are in different execution queues, these execution queues run independently and do not interfere with each other, so even if an error occurs in a certain sub-task during the execution process, it will not affect Execution of other sub-tasks. Therefore, if an error occurs in the execution of a sub-task, only the sub-task in which the error occurs can be re-run without re-running all the sub-tasks or the total task as a whole. While the faulty subtask is running, other queues may be idle, or other subtasks may be executing concurrently. Therefore, in the present disclosure, dividing a total task into multiple parallel sub-tasks can improve the utilization rate of system processing resources and improve processing efficiency.

According to an embodiment of the present disclosure, in response to an error in one or more sub-tasks, the sub-task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.

When an error occurs in a sub-task and needs to be re-executed, the sub-task with the error can be added to the task allocation queue LQ as a new total task, and the sub-task is further divided into multiple sub-tasks, and the sub-tasks are divided into multiple sub-tasks in parallel. The sub-task with the error is re-executed in the execution queue PQ. The sub-task with errors is further divided into multiple sub-tasks for re-execution, which further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-task, the time and processing resources for correcting the error are greatly reduced.

There are many kinds of the above tasks, such as calculation operation tasks, multiplication operation tasks, convolution calculation tasks, weight calculation tasks, communication tasks, and so on. Therefore, depending on the task, the allocation queue can be the communication queue used in the deep learning framework, such as the dedicated communication queue (Comm_queue) used in Tensorflow and Pytorch, and the execution queue can be the execution queue in the communication library, for example, it can be The internal execution queue (Internal_queue) in the NCCL communication library. 5 shows a schematic diagram of an apparatus for executing asynchronous tasks according to an embodiment of the present disclosure, the apparatus includes: a dividing unit M510 configured to divide a total task in the task queue into a plurality of sub-tasks, each sub-task is in In different sub-task queues; a sub-task execution unit M520, configured to execute the plurality of sub-tasks in parallel; and an end unit M530, configured to respond to the completion of execution of the sub-tasks, thereby completing the execution of the total task.

The present disclosure also provides a chip including the device shown in FIG. 5 .

The present disclosure also provides an electronic device including the chip as described above.

The present disclosure also provides an electronic device comprising: one or more processors; and a memory having computer-executable instructions stored therein, when the computer-executable instructions are executed by the one or more processors , so that the electronic device executes the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The technical solutions of the present disclosure can be applied to the field of artificial intelligence, and are implemented as or in an artificial intelligence chip. The chip can exist alone or can be included in a computing device.

FIG. 6 shows a combined processing device 600 that includes the aforementioned computing device 602 , a general interconnection interface 604 , and other processing devices 606 . The computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user. FIG. 6 is a schematic diagram of a combined processing device.

Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data handling, to complete basic controls such as starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.

A universal interconnect interface for transferring data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices. The computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device The data in the storage module is transmitted to other processing devices.

Optionally, the structure may further include a storage device 608, and the storage device is respectively connected to the computing device and the other processing device. The storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.

The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the general interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

In some embodiments, the present disclosure also discloses a chip package structure including the above-mentioned chip.

In some embodiments, the present disclosure also discloses a board including the above chip package structure. Referring to FIG. 7 , an exemplary board card is provided. In addition to the above-mentioned chip 702 , the above-mentioned board card may also include other supporting components, including but not limited to: a storage device 704 , an interface device 706 and a control device 708.

The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 710 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.

The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 712 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.

The control device is electrically connected to the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a microcontroller (Micro Controller Unit, MCU). For example, the chip may include multiple processing chips, multiple processing cores or multiple processing circuits, and may drive multiple loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation of the working states of multiple processing chips, multiple processing and/or multiple processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or device, which includes the above board.

Electronic equipment or devices include data processing devices, robots, computers, printers, scanners, tablet computers, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches , headsets, mobile storage, wearables, vehicles, home appliances, and/or medical equipment.

The vehicles include airplanes, ships and/or vehicles; the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lamps, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.

It should be noted that, for the sake of simple description, the foregoing method embodiments are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described action sequences. Because certain steps may be performed in other orders or concurrently in accordance with the present disclosure. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily required by the present disclosure.

In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. Another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection of devices or units, which may be electrical, optical, acoustic, magnetic or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, and can also be implemented in the form of software program modules.

The integrated unit, if implemented in the form of a software program module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on this understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions to enable a computer device (which may be a personal computer, a server or a network device). etc.) to perform all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, based on the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limiting the present disclosure. 2020116106707

2021100550976 The following will clearly and completely describe the technical solutions in the embodiments of the present disclosure with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, but not all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present disclosure.

It should also be understood that the terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used in this disclosure and the claims, the singular forms "a," "an," and "the" are intended to include the plural unless the context clearly dictates otherwise. It should further be understood that, as used in this disclosure and the claims, the term "and/or" refers to and including any and all possible combinations of one or more of the associated listed items.

As used in this specification and in the claims, the term "if" may be contextually interpreted as "when" or "once" or "in response to determining" or "in response to detecting". Similarly, the phrases "if it is determined" or "if the [described condition or event] is detected" may be interpreted, depending on the context, to mean "once it is determined" or "in response to the determination" or "once the [described condition or event] is detected. ]" or "in response to detection of the [described condition or event]".

The specific embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 35 shows a method for performing a communication task in an accelerator card system according to an embodiment of the present disclosure, wherein the accelerator card system includes a plurality of accelerator cards capable of communicating with each other, and one of the plurality of accelerator cards accelerates The card can communicate with another accelerator card through a communication path; the method includes: in operation S3510, establishing a communication task queue, where the communication task queue includes a communication task and an execution state for monitoring the communication task. a status identifier; in operation S3520, establishing a communication task execution queue for performing a communication task between the accelerator cards through a communication path; and, in operation S3530, in response to the execution of the communication task, changing the status identifier to monitor The execution status of the communication task.

First, various embodiments of the accelerator card system will be described in detail below with reference to the accompanying drawings. The accelerator card system in this paper consists of multiple accelerator cards that can communicate with each other. These accelerator cards can be communicatively connected through different communication paths, so that when starting from one accelerator card to another accelerator card, it can pass different Communication paths arrive, thus forming different communication topologies. It should be understood that the connection in the following refers to a communicative connection, that is, each accelerator card can communicate with each other and transmit data.

In addition, the acceleration card system described above may be formed as an acceleration unit, an acceleration assembly, an acceleration device, or the like. It should be understood that although different terms are used in the context depending on the specific scenario, they are essentially systems that include multiple accelerator cards.

Fig. 8a is a schematic diagram showing the structure of an acceleration unit in an embodiment disclosed. According to an embodiment of the present disclosure, the accelerator card system may include an acceleration unit, and the acceleration unit may include M local accelerator cards, each local unit accelerator card including an internal port, each local unit accelerator card passing through the internal The connecting port is connected to other accelerator cards of this unit, wherein the M accelerator cards of this unit are logically formed into an accelerator card matrix of L*N scale, and L and N are integers not less than 2.

As shown in FIG. 8a, an accelerator card matrix can be formed by a plurality of accelerator cards, and the accelerator cards are connected to each other, so that data or instructions can be transmitted and communicated. For example, the accelerator cards MC00 to MCON form the 0th row of the accelerator card matrix, the accelerator cards MC10 to MC1N form the 1st row of the accelerator card matrix, and so on, the accelerator cards MCL0 to MCLN form the Lth row of the accelerator card matrix.

It should be understood that, in order to facilitate the understanding of the context, the accelerator cards in the same acceleration unit are referred to as "local unit accelerator cards", and the accelerator cards in other acceleration units are referred to as "external unit accelerator cards". Such terms are only for convenience of description, and do not limit the technical solutions of the present disclosure.

Each accelerator card can have multiple ports, and these ports can be connected to the accelerator card of this unit or to the accelerator card of an external unit. In the present disclosure, the connection ports between the accelerator cards of this unit may be referred to as internal ports, and the connection ports between the accelerator cards of this unit and the external unit accelerator cards may be referred to as external ports. It should be understood that the external port and the internal port are only for the convenience of description, and the same port may be used for both. This will be described below.

It should be understood that M can be any integer, and the M accelerator cards can be formed into a 1*M or M*1 matrix, or the M matrices can be formed into other types of matrices. The acceleration unit of the present disclosure is not limited to a specific matrix size and form.

Further, between the accelerator cards, for example, between the accelerator cards of the unit, and between the accelerator cards of the unit and the external unit accelerator cards, a single or multiple communication paths may be used to connect. This will be described in detail later.

It should also be understood that, in the context of the present disclosure, although rectangular networks are used to describe the positions between multiple accelerator cards, in fact, the formed matrix is not necessarily in the form of a matrix in physical space arrangement. It can be in any position, for example, multiple accelerator cards can form a straight line or multiple accelerator cards can be arranged irregularly. The above matrix is only in terms of logic, as long as the connection between the accelerator cards forms a matrix relationship.

According to an embodiment of the present disclosure, M may be 4, and thus, 4 accelerator cards of this unit may logically form a 2*2 accelerator card matrix; M may be 9, so that 9 accelerator cards of this unit may It is logically formed into a 3*3 accelerator card matrix; M can be 16, so that 16 accelerator cards of this unit can logically form a 4*4 accelerator card matrix. M can also be 6, so that 6 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; M can also be 8, so that 8 accelerator cards of this unit can logically form a matrix of 2*3 or 3*2 accelerator cards; Formed into a 2*4 or 4*2 accelerator card matrix.

According to an embodiment of the present disclosure, each local-unit accelerator card is connected to at least one other local-unit accelerator card through two paths.

In the topology described in the present disclosure, two local accelerator cards may be connected through a single communication path, or may be connected through multiple (eg, two) paths, as long as the number of ports is sufficient. Connecting through multiple communication paths is beneficial to ensure the reliability of communication between the acceleration cards, and is helpful to form different topological structures. This will be explained and described in more detail in the examples below.

According to an embodiment of the present disclosure, diagonal local accelerator cards located at four corners in the accelerator card matrix are connected by two paths. For a matrix, it is preferable to connect two pairs of accelerator cards on the diagonal of the matrix. For some topological structures, the connection of accelerator cards on the diagonal is helpful to form two complete communication loops. . This will be explained and described in more detail in the examples below.

More specifically, according to an embodiment of the present disclosure, at least one of the local-unit accelerator cards may include an external port. For example, each acceleration unit may include four local unit accelerator cards, each local unit accelerator card may include six ports, and the four ports of each local unit accelerator card are internal ports, which are used for connecting with other three The remaining two ports of at least one local unit accelerator card are external ports, which are used to connect with the external unit accelerator card.

It should be understood that, among the six ports of each accelerator card of this unit, four ports can be used to connect the accelerator card of this unit, and the remaining two ports can be used to connect the accelerator cards in other acceleration units. These vacant ports can also be idle ports, not connected to any external device, or directly or indirectly connected to other devices or ports.

For the purpose of example and simplification, the acceleration unit, the acceleration device for acceleration components, and the electronic device hereinafter are all described by taking each acceleration unit including four acceleration cards as an example. It should be understood that each acceleration unit may include a greater or lesser number of accelerator cards.

For the convenience of description, the acceleration unit may include four accelerator cards, namely a first accelerator card, a second accelerator card, a third accelerator card and a fourth accelerator card, each of which is provided with an internal port and an external port, Each accelerator card is connected to the other three accelerator cards through the internal port.

FIG. 8b is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure. The acceleration unit 800 includes four accelerator cards, which are an accelerator card MC0, an accelerator card MC1, an accelerator card MC2, and an accelerator card MC3. For four accelerator cards, each accelerator card can include an external port and an internal port. The internal port of the accelerator card MC0 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3. The internal port of the accelerator card MC1 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3. The internal ports of the accelerator cards MC2 and MC3 are all connected, and the internal port of the accelerator card MC2 is connected to the internal port of the accelerator card MC3, that is, the internal port of each accelerator card is connected to the internal ports of the other three accelerator cards. All connected. The information exchange among the four accelerator cards can be realized through the interconnection of the internal ports of the four accelerator cards. The embodiment of the present disclosure utilizes the interconnection between the four acceleration cards in the acceleration unit, which can improve the computing capability of the acceleration unit and realize the purpose of processing massive data at high speed, and make the path between each acceleration card and other acceleration cards the shortest, and the communication Lowest latency.

As described above, the number of accelerator cards in the present disclosure may not be limited to four, but may be other numbers. For example, in one embodiment, the number N of accelerator cards is equal to 3, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other two accelerator cards through the internal port, so as to realize three interconnection between accelerator cards. In another embodiment, the number N of accelerator cards is equal to 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other four accelerator cards through the internal port, so that five The interconnection between acceleration cards increases the computing power of the acceleration unit and realizes high-speed processing of massive data. In yet another embodiment, the number N of accelerator cards is greater than 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to all other accelerator cards through the internal port, so that N Accelerate the interconnection between cards to achieve high-speed processing of massive data.

Based on the acceleration unit 800 provided in FIG. 8b, further, each acceleration card and at least one other acceleration card may be connected through two paths. Specifically, there can be, for example, three connection methods: the first connection method is that each accelerator card can be connected to one of the other three accelerator cards through two paths; the second method is that each accelerator card can be connected to Two of the other three accelerator cards are connected by two paths; the third way is that each accelerator card can be connected with the other three accelerator cards by two paths, in this case, It cannot be ruled out that there are more ports per accelerator card. In order to facilitate the understanding of the above-mentioned connection manner of the two paths, the following will take the first connection manner as an example and perform an exemplary description in conjunction with FIG. 9 .

FIG. 9 is a schematic structural diagram of an acceleration unit in another embodiment of the present disclosure. In the acceleration unit 900 shown in FIG. 9 , each accelerator card and at least one other accelerator card can be connected by two paths, for example, the accelerator card MC0 and the accelerator card MC2 in the figure can be connected by two paths , and the accelerator card MC1 and the accelerator card MC3 can be connected by two paths. According to this setting, there can be two links (or paths) for information exchange between two accelerator cards. In this way, when one of the links fails, there is another link between the two accelerator cards. This can effectively improve the security of the acceleration unit.

The connection between the acceleration unit and its multiple acceleration cards according to the present disclosure has been exemplarily described above with reference to FIG. 8 and FIG. 9 . It should be understood by those skilled in the art that the above description is exemplary rather than limiting. For example, the arrangement of the accelerator cards in the acceleration unit may not be limited to the form shown in FIG. 8 and FIG. 9. In one embodiment, the four accelerator cards of the acceleration unit may be logically arranged in a quadrilateral arrangement. The following The description will be made in conjunction with FIG. 10 .

FIG. 10 is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure. In the acceleration unit 1000 shown in FIG. 10 , four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards can occupy four vertex positions of the quadrilateral. The lines between the accelerator cards MC0, MC1, MC2 and MC3 are quadrilateral, which makes the arrangement of the lines clearer and facilitates the setting of the lines. It should be noted that the four accelerator cards shown in Figure 10 are arranged in a rectangle or a 2*2 matrix, but this is a logical interconnection diagram. It is drawn in the form of a rectangle for the convenience of description. The specific quadrilateral can be freely set, such as a parallelogram , trapezoid, square, etc. In the actual layout and wiring, the four accelerator cards can also be arranged arbitrarily. For example, in the actual whole machine, the four accelerator cards are arranged side by side in a line shape, and the order can be MC0, MC1, MC2, MC3. It should also be understood that the logical quadrilateral described in this embodiment is exemplary, and in fact, the arrangement shape of multiple accelerator cards can be ever-changing, and the quadrilateral is only one of them. For example, when the number of accelerator cards is five, they can be logically arranged in a pentagon.

Based on the connection relationship of the acceleration unit 900 provided in FIG. 9 , further refer to FIG. 11 , which is a schematic structural diagram of an acceleration unit in yet another embodiment of the present disclosure. In the acceleration unit 1100 shown in FIG. 11 , the four accelerator cards MC0 , MC1 , MC2 and MC3 can be logically arranged in a quadrilateral arrangement, and the four accelerator cards occupy four vertex positions of the quadrilateral respectively. As further shown in the figure, two paths can be used for connection between the internal port of the accelerator card MC1 and the internal port of the accelerator card MC3, and between the internal port of the accelerator card MC0 and the internal port of the accelerator card MC2 There are two paths to connect. In this way, for the acceleration unit 1100, not only the line setting is convenient, but also the safety is improved.

FIG. 12a is a schematic structural diagram of an acceleration unit in an embodiment of the present disclosure. In the acceleration unit 1200 as shown in FIG. 12a, the digital marks on each acceleration card represent ports, and each acceleration card may include six ports, namely, port 0, port 1, port 2, port 3, port 4, port 5. Among them, port 1, port 2, port 4 and port 5 are internal ports, and port 0 and port 3 are external ports. For the four accelerator cards MC0, MC1, MC2 and MC3, the 2 external ports of each accelerator card can be connected to other accelerator units for interconnection between multiple accelerator units. The 4 internal ports of each accelerator card can be used to interconnect with the other three accelerator cards in this accelerator unit.

As further shown in Figure 12a, the four accelerator cards may be logically arranged in, for example, a quadrilateral, accelerator card MC0 and accelerator card MC2 may be in a diagonal relationship, port 2 of MC0 is connected to port 2 of MC2, and port 2 of MC0 5 is connected to port 5 of MC2, that is, there can be two links for communication between the accelerator card MC0 and the accelerator card MC2. Accelerator card MC1 and accelerator card MC3 can be in a diagonal relationship. Port 2 of MC1 is connected to port 2 of MC3, and port 5 of MC1 is connected to port 5 of MC3, that is, the connection between accelerator card MC1 and accelerator card MC3 can be There are two links for communication.

According to this setting, since each accelerator card has two external ports and four internal ports, and two pairs of accelerator cards are in a diagonal relationship, two accelerator cards can be used between the two accelerator cards of each pair of accelerator cards. The internal port is connected to form two links, so it can effectively improve the security and stability of the acceleration unit. And the quadrilateral arrangement logically arranged on the four accelerator cards makes the circuit layout of the entire acceleration unit reasonable and clear, and facilitates the wiring operation in each acceleration unit. It should be further noted that, in the interconnection lines between the four accelerator cards as shown in FIG. 12b, for the connection line between port 1 of the accelerator card MC1 and port 1 of MC0, port 2 of the accelerator card MC0 and port 1 of the accelerator card MC0 The connection line between port 2 of MC2, the connection line between port 1 of accelerator card MC2 and port 1 of MC3, and the connection line between port 2 of accelerator card MC3 and port 2 of MC1, these four lines constitute A vertical figure-8 network, as shown in Figure 12b. For the connection line between port 4 of accelerator card MC1 and port 4 of MC2, the connection line between port 5 of accelerator card MC2 and port 5 of MC0, and the connection between port 4 of accelerator card MC0 and port 4 of MC3 The line, the connection line between the port 5 of the accelerator card MC3 and the port 5 of the MC1, these four lines form a horizontal figure-8 network, as shown in Figure 12c. Such two fully connected square networks can form a double-ring structure, which has the function of redundancy backup and enhanced system reliability.

According to an embodiment of the present disclosure, the accelerator card described in the present disclosure may be a Mezzanine Card (MC card for short), which may be a separate circuit board. The MC card can be equipped with ASIC chips and some necessary peripheral control circuits. The MC card can be connected to the base board through the pin board connector. The power and control signals on the base board can be transmitted to the MC card through the daughter board connector. According to another embodiment of the present disclosure, the internal port and/or the external port described in the present disclosure may be a SerDes port. For example, in one embodiment, each MC card can provide 6 bidirectional SerDes ports, each SerDes port has 8 channels and a data transmission rate of 56Gbps, the total bandwidth of each port can be as high as 400Gbps, which can support acceleration cards Mass data exchange with the accelerator card helps the acceleration unit to process massive data at high speed.

The SerDes mentioned above is a composite word of the English word Serializer (Serializer) and De-Serializer (De-Serializer), and is called a Serializer. The SerDes interface can be used to build clusters of high-performance processors. The main function of Serdes is to convert multi-channel low-speed parallel signals into serial signals at the sending end, transmit through the transmission medium, and finally re-convert high-speed serial signals into low-speed parallel signals at the receiving end, so it is very suitable for end-to-end long-term long-distance high-speed transmission requirements. In another embodiment, the external port in the acceleration card can be connected to the QSFP-DD interface of other acceleration units, wherein the QSFP-DD interface is an optical module interface commonly used in SerDes technology, which can be used in conjunction with cables. for interconnection with other external devices.

Further, according to yet another embodiment of the present disclosure, one acceleration unit may be equipped with 4 accelerator cards, and the interconnection of the 4 accelerator cards may be completed by using a printed circuit board (PCB) wiring. On high-speed boards with low dielectric constant, through reasonable layout and wiring, signal integrity can be ensured to the greatest extent, thereby ensuring that the communication bandwidth between the four accelerator cards tends to the theoretical value.

In the acceleration unit disclosed in the present disclosure, inside the acceleration unit, for four accelerator cards, each accelerator card is connected to the other three accelerator cards through the internal port of the accelerator card, and each accelerator card can directly communicate with the other three accelerator cards. This communication architecture is a fully connected quad network topology. The advantage of this fully connected network architecture is that the path between each accelerator card and other accelerator cards is the shortest, and the total number of hops is the smallest. Lowest latency. The present disclosure uses Hop to describe the time delay of the system, and Hop represents the number of hops in communication, that is, the number of times of communication. Hop specifically refers to the shortest path starting from a node and returning to the initial node after traversing all nodes in the network. 4 accelerator cards are interconnected, forming a fully connected square network topology with the shortest delay, and the double-ring structure formed by the interconnection of two diagonal accelerator cards can improve the robustness of the system, and services can still be normal when a single accelerator card fails run. When performing various arithmetic and logic operations, each ring in the dual-ring structure can separately complete a part of the operation, thereby improving the overall operation efficiency and maximizing the utilization of the topology bandwidth.

Multiple embodiments of the acceleration unit according to the present disclosure have been described above with reference to FIGS. 8a to 12c. Based on the above-mentioned acceleration unit, the present disclosure also discloses an acceleration assembly that may include a plurality of the above-mentioned acceleration units. The following will be combined with Various embodiments of the acceleration assembly are illustratively described.

FIG. 13 is a schematic structural diagram of an acceleration component in an embodiment of the present disclosure. As shown in FIG. 13 , the acceleration assembly 1300 may include n above-mentioned acceleration units, in other words, the acceleration card system may be embodied as an acceleration assembly, which includes a plurality of acceleration units, namely acceleration unit A1, acceleration unit A2, acceleration unit A3, ..., acceleration unit An, wherein the acceleration unit A1 and the acceleration unit A2 are connected through an external port, and the acceleration unit A2 and the acceleration unit A3 are connected through an external port, that is, each acceleration unit is connected through the acceleration unit. connected to the external port of the unit. In one embodiment, the external port of the accelerator card MC0 in the acceleration unit A1 can be connected with the external port of the accelerator card MC0 in the acceleration unit A2, and the external port of the accelerator card MC0 in the acceleration unit A2 can be connected with the external port of the accelerator card MC0 in the acceleration unit A3. The external port is connected, that is, each acceleration unit is connected through the external port of the acceleration card MC0.

Those skilled in the art can understand that the connection between the acceleration units in the present disclosure may not be limited to the connection of the external port of the accelerator card MC0, but may also include, for example, the connection of the external port of the accelerator card MC1 and the connection of the external port of the accelerator card MC2. One or more of the connection and the connection of the external port of the accelerator card MC3. That is, in the present disclosure, the connection mode of the acceleration unit A1 and the acceleration unit A2 may include: the external port of MC0 in A1 is connected with the external port of MC0 in A2, the external port of MC1 in A1 is connected with the external port of MC1 in A2, and the external port of MC1 in A1 is connected. One or more connection methods in which the external port of MC2 is connected to the external port of MC2 in A2, and the external port of MC3 in A1 is connected to the external port of MC3 in A2. Similarly, the connection mode of the acceleration unit A2 and the acceleration unit A3 may include: the external port of MC0 in A2 is connected to the external port of MC0 in A3, the external port of MC1 in A2 is connected with the external port of MC1 in A3, and the external port of MC2 in A2 is connected. The external port is connected with the external port of MC2 in A3, and the external port of MC3 in A2 is connected with the external port of MC3 in A3. One or more connection methods. By analogy, the connection between the acceleration unit An-1 and the acceleration unit An can be reached. It should be noted that the above description is exemplary, for example, the connection between different acceleration units may not be limited to the connection of the acceleration card corresponding to the label, and may be set to the connection of the acceleration card with the same label as required.

It should be noted that Figure 13 shows n acceleration units, and n is greater than 3, but the number of acceleration units may not be limited to greater than 3 in the figure, but can also be set to, for example, 2 or 3, two acceleration units The connection relationship between the above-mentioned acceleration units A1 and A2 is the same or similar, and the connection relationship between the three acceleration units is the same as or similar to the connection relationship between the above-mentioned acceleration units A1, A2, A3. Here No longer.

In addition, the structures of the multiple acceleration units in the acceleration assembly may be the same or different. In FIG. 13, for the convenience of illustration, the structures of the multiple acceleration units shown are the same, but in practice, the structures of the multiple acceleration units may be is different. For example, the layout of multiple accelerator cards in some acceleration units is a polygon, and the layout of multiple accelerator cards in some acceleration units is a line. Multiple accelerator cards are connected through two links, etc. Some acceleration units include four accelerator cards, and some acceleration units include three or five accelerator cards, etc., that is, the structure of each acceleration unit can be set separately , the structures of different acceleration units may be the same or different.

In the acceleration assembly disclosed in the present disclosure, not only the acceleration cards inside the acceleration unit in the acceleration assembly can be interconnected, but also the acceleration cards of different acceleration units can be interconnected, so that a hybrid three-dimensional network can be constructed. According to this setting, each accelerator card can also share data through the interconnection between acceleration units while processing data. Since data sharing can directly obtain data, the data propagation path and time are reduced, so it is necessary to improve data efficiency. Processing efficiency plays a significant role.

FIG. 14 is a schematic structural diagram of an acceleration component in another embodiment of the present disclosure. As shown in FIG. 14 , the acceleration component 1400 may include n the aforementioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . . , the acceleration unit An, and the acceleration units in the acceleration component 1400 are logically The upper layer may have a multi-layer structure (shown by a dotted line in the figure), each layer may include an acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card in another acceleration unit through an external port. Such a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster. As further shown in the figure, the acceleration unit of each layer may include four acceleration cards, the acceleration units may be logically arranged in a quadrilateral arrangement, and the four acceleration cards are respectively arranged at four vertex positions of the quadrilateral.

It should be understood by those skilled in the art that the acceleration components described above in conjunction with FIG. 14 are exemplary and not limiting. For example, the structures of the multiple acceleration units may be the same or different. The number of layers of the acceleration component can be 2 layers, 3 layers, 4 layers or more than 4 layers, and the number of layers can be freely set as required. For every two connected acceleration units, the number of connection paths between the two can be 1, 2, 3 or 4. For ease of understanding, an exemplary description will be made below with reference to FIGS. 15-19 .

FIG. 15 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. As shown in FIG. 15 , the number of acceleration units in the acceleration component 1401 can be 2, and the two acceleration units are connected through a path. Specifically, for example, the external port of the acceleration card MC0 in the acceleration unit A1 can be connected to the acceleration unit. The external port of the acceleration card MC0 in the unit A2 is connected to realize the information exchange between the acceleration unit A1 and the acceleration unit A2.

As shown in FIG. 16 , the number of acceleration units in the acceleration component 1402 can be two, and the two acceleration units are connected through two paths. The external port of the acceleration card MC0 in the acceleration unit A1 is connected to the acceleration unit in the acceleration unit A2. The external port of the card MC0 is connected, and the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2. In this way, when one of the paths fails, there is another line to support the communication between the acceleration units, further improving the safety of the acceleration components.

Please refer to FIG. 17 below. FIG. 17 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. In the acceleration component 1403 shown in FIG. 17 , the number of acceleration units can be 2, and the two acceleration units are connected through three paths. The external port of the accelerator card MC0 in the acceleration unit A1 and the accelerator card in the acceleration unit A2 The external port of MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the external port of the accelerator card MC2 in the acceleration unit A2. connected to an external port. In this way, even when two of the paths fail, there is another path to support communication between the acceleration units, further improving the safety of the acceleration components.

Please refer to FIG. 18 below. FIG. 18 is a schematic structural diagram of an acceleration component in yet another embodiment of the present disclosure. In the acceleration component 1404 shown in FIG. 18 , the number of acceleration units can be 2, and the two acceleration units can be connected through four paths, for example, the external port of the acceleration card MC0 in the acceleration unit A1 and the external port of the acceleration unit A2 The external port of the accelerator card MC0 is connected, the external port of the accelerator card MC1 in the acceleration unit A1 is connected with the external port of the accelerator card MC1 in the acceleration unit A2, and the external port of the accelerator card MC2 in the acceleration unit A1 is connected with the accelerator card in the acceleration unit A2. The external port of the MC2 is connected, and the external port of the accelerator card MC3 in the acceleration unit A1 is connected with the external port of the accelerator card MC3 in the acceleration unit A2. In this way, even when three of the paths fail, there is another path to support communication between the acceleration units, further improving the safety of the acceleration components.

Figure 19a is a schematic diagram of the acceleration components represented as a network topology. As shown in FIG. 19a, the acceleration component 1405 may include two acceleration units, each acceleration unit may include four acceleration cards, and there may be two links between the acceleration card MC1 and the acceleration card MC3 in each acceleration unit, There may be two links between the accelerator card MC0 and the accelerator card MC2. The acceleration device 1405 in the left figure of FIG. 19a can form the stereoscopic representation shown in the right figure. The circles in the right figure of Figure 19a represent accelerator cards, and the lines represent link connections. The number 0 in the circle represents the accelerator card MC0, the number 1 represents the accelerator card MC1, the number 2 represents the accelerator card MC2, and the number 3 represents the accelerator card MC3. The figure on the right still shows the acceleration component 1405, just as another form of expression, that is, the form of the network topology is shown. The numbers embedded in the vertical lines in the right figure represent the connected port numbers. For example, port 0 is used for connection between MC0 in two acceleration units, port 0 is used for connection between MC1, and port 3 is used for connection between MC2. Use port 3 to connect between MC3s.

For the right picture in Figure 19a, one acceleration unit is regarded as a node, and two nodes have 8 accelerator cards, that is, two nodes constitute a so-called 8-card interconnection. The interconnection relationship of one machine and four cards inside each node is certain. When two nodes are interconnected, MC0 and MC1 in the upper node (ie acceleration unit A1) pass through port 0 and the lower node (ie acceleration unit A2) respectively. MC0 and MC1 are connected; MC2 and MC3 of the upper node are connected to MC2 and MC3 of the lower node through port 3 respectively. This node topology is called a hybrid three-dimensional network topology (Hybrid Cube Mesh), that is, the acceleration component 1405 is a hybrid three-dimensional network. topology.

In the topology with 8 cards shown in Figure 19a, two independent rings can also be formed. As shown in Figure 19b and Figure 19c, this maximizes the use of topology bandwidth for reduction operations.

In Figure 19b, the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 5, the accelerator cards MC0 and MC2 are connected via their respective internal ports 5, and the accelerator cards MC2 and MC3 are connected via their respective internal ports 1 connection; while the accelerator card MC1 in the acceleration unit A1 and the accelerator card MC1 in the acceleration unit A2 are connected through their respective external single ports 0, and the accelerator card MC0 in the acceleration unit A1 and the accelerator card MC0 in the acceleration unit A2 are connected through their respective External port 0 is connected. Thus, a separate ring is formed among the 8 cards in FIG. 19 .

In Figure 19c, the accelerator cards MC1 and MC3 in the acceleration unit A1 are connected via their respective internal ports 2, the accelerator cards MC0 and MC2 are connected via their respective internal ports 2, and the accelerator cards MC0 and MC1 are connected via their respective internal ports 1 connection; the accelerator card MC2 in the acceleration unit A1 and the accelerator card MC2 in the acceleration unit A2 are connected through their respective external single ports 3, and the accelerator card MC3 in the acceleration unit A1 and the accelerator card MC3 in the acceleration unit A2 are connected through their respective External port 3 is connected. Thus, another independent loop is formed in the 8 cards in FIG. 19 .

Only two exemplary connection methods are shown above, but in fact, the four connection paths between the two acceleration units are actually equivalent, so any one to three of these four paths can be used to connect the two Accelerator units, and form a ring connection with the accelerator cards in each acceleration unit. It will not be repeated here.

FIG. 20 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure. As shown in FIG. 20 , the acceleration device 2000 may include n above-mentioned acceleration units, namely, the acceleration unit A1, the acceleration unit A2, the acceleration unit A3, . . . , the acceleration unit An, and the acceleration units in the acceleration device 2000 are logical There is a multi-layer structure (shown in dotted lines in the figure), where the multi-layers can include odd or even layers, each layer can include an acceleration unit, and the accelerator card of each acceleration unit communicates with another acceleration unit through an external port Accelerator cards in the device are connected to each other, wherein, the acceleration unit A1 and the acceleration unit A2 are connected through an external port, the acceleration unit A2 and the acceleration unit A3 are connected through an external port, and so on. An is connected through an external port. And the last acceleration unit can be connected with the first acceleration unit, so that the multiple acceleration units are connected end to end to form a ring structure, for example, the external port of the acceleration card MC0 of the acceleration unit An in the figure and the acceleration card of the acceleration unit A1. Connect to the external port of MC0. Such a progressive configuration combination enables each accelerator card to share data through a high-speed serial link while processing data at high speed. Realize the flexible configuration of the hardware computing power of the processor cluster.

It should be noted that the connection relationship of the acceleration unit in the acceleration device in the present disclosure has various situations, which have been described in detail above. For details, please refer to, for example, the description of the connection relationship of the acceleration unit in FIG. No longer. In addition, there are various ways in which the last acceleration unit is connected to the first acceleration unit, which may specifically include: the external port of MC0 in the acceleration unit A1 is connected to the external port of MC0 in An, and the external port of MC1 in the acceleration unit A1 is connected to One or more connection methods of connecting the external port of MC1 in An, connecting the external port of MC2 in the acceleration unit A1 with the external port of MC2 in An, and connecting the external port of MC3 in the acceleration unit A1 with the external port of MC3 in An. . For ease of understanding, an exemplary description will be made below in conjunction with FIG. 21 and FIG. 22 . In the following description, those skilled in the art can understand that the acceleration device shown in FIG. 21 and FIG. 22 are various embodied forms of the acceleration device 2000 shown in FIG. 20 . Therefore, the acceleration device 2000 shown in FIG. The relevant description of can also be applied to the acceleration device in FIG. 21 and FIG. 22 .

Referring to FIG. 21 , FIG. 21 is a schematic diagram of a network topology corresponding to an acceleration device in an embodiment. The acceleration device 2001 shown in FIG. 21 can be composed of four acceleration units. The circles represent accelerator cards, and the lines represent link connections. The number 0 in the circle represents the accelerator card MC0, the number 1 represents the accelerator card MC1, and the number 2 represents the acceleration card. Card MC2, the number 3 represents the accelerator card MC3; the number embedded in the vertical line in the figure represents the number of the connected port. The last acceleration unit is connected to the first acceleration unit, and the total number of hops is 5 times. Each acceleration unit is a node. Through the interconnection between nodes, 4 nodes and 16 cards can be interconnected. The four acceleration units form a small cluster, which is interconnected internally, which is called a super computing cluster super pod. This topology is the main push form of ultra-large-scale clusters, using high-speed SerDes ports, the total number of hops is 5, and the delay is the lowest. The manageability of the cluster is better, and the robustness is also better.

Referring to FIG. 22 , FIG. 22 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment. The difference between FIG. 22 and FIG. 21 is that the acceleration device 2002 shown in FIG. 22 has more acceleration units. It can be seen from the illustration that the last acceleration unit of the acceleration device 2002 is connected to the first acceleration unit. According to the acceleration device set in this way, the total number of hops is the number of nodes plus one, that is, the total number of hops is the number of acceleration units plus one.

The acceleration device including a plurality of acceleration units is exemplarily described above with reference to FIGS. 20-22 . According to the technical solution of the present disclosure, an acceleration device that can include a plurality of the aforementioned acceleration components is also provided. Examples are described in detail.

FIG. 23 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure. The acceleration system of the present disclosure may be implemented as an acceleration device. The acceleration device 3000 may include m aforesaid acceleration components. In each acceleration component, in addition to the external ports within the acceleration component that need to be connected between the acceleration units, there are also idle external ports, and the acceleration components communicate with each other through the idle external ports. Connection, wherein, the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B1 can be connected with the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2, and the external port of the accelerator card MC1 of the acceleration unit A1 in the acceleration component B2 can be connected. The port can be connected to the external port of the acceleration card MC1 of the acceleration unit A1 in the acceleration component B3, and so on, and multiple acceleration components are connected to each other. It can be understood that the acceleration device shown in FIG. 23 is exemplary and not limiting, for example, the structures of the multiple acceleration components may be the same or different. Also for example, the manner of connecting different acceleration components through idle external ports may not be limited to the manner shown in FIG. 23 , and may also include other manners. For ease of understanding, an exemplary description will be made below with reference to FIGS. 24-32 .

Based on the acceleration device provided in FIG. 23, further, referring to FIG. 24, FIG. 24 is a schematic diagram of a network topology corresponding to the acceleration device in another embodiment, the acceleration device 3001 may include two acceleration components, and the acceleration component B1 may include four acceleration units , the acceleration component B2 may include four acceleration units, the first acceleration unit in the acceleration component B1 is connected with the first acceleration unit in the acceleration component B2, the last acceleration unit in the acceleration component B1 is connected with the last acceleration unit in the acceleration component B2 connected. The total number of hops in this network topology is 9. Those skilled in the art can understand that the network structure composed of multiple acceleration units in each acceleration component in FIG. 24 is logical, and the arrangement positions of multiple acceleration units can be adjusted as required in practical applications. The number of acceleration units in each acceleration assembly may not be limited to the four shown in the figure, and may be set more or less as required, for example, six, eight, etc. may be set.

Based on the acceleration device provided in FIG. 23 , further referring to FIG. 25 , which is a schematic diagram of the acceleration device in yet another embodiment of the present disclosure, the acceleration device 3002 may include four acceleration components, namely, acceleration components B1 , B2 , B3 and B4 . Among the four acceleration assemblies, each acceleration assembly may include two acceleration units A1 and A2, and each acceleration assembly may be interconnected with one of the other acceleration units A1 and A2 through one of the acceleration units A1 and A2. For example, the acceleration unit A1 in the acceleration component B1 is connected to the acceleration unit A1 in the acceleration component B2, the acceleration unit A1 in the acceleration component B2 is connected with the acceleration unit A1 in the acceleration component B3, and the acceleration unit A1 in the acceleration component B3 is connected. It is connected to the acceleration unit A1 in the acceleration component B4, and the connections here are all connected through the external port of the acceleration unit.

It should be noted that, in addition to the connection modes shown in FIG. 25 , there may be many connection modes between the acceleration components. For example, the connection between the acceleration components may specifically include: the acceleration unit A1 or A2 in the acceleration component B1 is connected with the acceleration unit A1 or A2 in the acceleration component B2, and the acceleration unit A1 or A2 in the acceleration component B2 is connected with the acceleration component. The acceleration unit A1 or A2 in B3 is connected, and the acceleration unit A1 or A2 in the acceleration assembly B3 is connected with the acceleration unit A1 or A2 in the acceleration assembly B4.

Based on the acceleration device provided in FIG. 25 , please refer to FIG. 26 , which is a schematic diagram of the acceleration device in yet another embodiment of the present disclosure. In the acceleration device 3003 shown in FIG. 26 , each acceleration component can pass through one of the first acceleration unit and the second acceleration unit, and use two paths to communicate with other acceleration components in the first acceleration unit and the second acceleration unit by using two paths. one is interconnected. For example, the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B1 in the figure and the first acceleration unit (eg, acceleration unit A1 ) in the acceleration component B2 can be connected by two paths, and the acceleration in the acceleration component B2 can be connected by two paths. The unit A1 is connected with the acceleration unit A1 in the acceleration component B3 through two paths, and the acceleration unit A1 in the acceleration component B3 is connected with the acceleration unit A1 in the acceleration component B4 through two paths.

It should be noted that, what is marked in FIG. 26 is the connection of two paths, and in fact, it may also include the connection of more than two paths. In addition to the connection method shown in FIG. 26, the connection method between the acceleration components can also include other methods. For example, the acceleration unit A1 or A2 in the acceleration component B1 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B2. Connected, the acceleration unit A1 or A2 in the acceleration component B2 can use two paths to connect with the acceleration unit A1 or A2 in the acceleration component B3, and the acceleration unit A1 or A2 in the acceleration component B3 can use two paths to connect with the acceleration unit A1 or A2. Acceleration unit A1 or A2 in component B4 is connected.

Based on the acceleration device provided in FIG. 23 , please refer to FIG. 27 . FIG. 27 is a schematic diagram of the acceleration device in another embodiment of the present disclosure. The acceleration device 3004 includes four acceleration components, namely, the acceleration component B1, the acceleration component B2, the acceleration component Component B3 and acceleration component B4, each acceleration component includes two acceleration units, and each acceleration unit includes two pairs of acceleration cards. In each acceleration unit, MC0 and MC1 are the first pair of accelerator cards, and MC2 and MC3 are the second pair of accelerator cards. Wherein, the second pair of accelerator cards of the acceleration unit A1 of the acceleration component B1 is connected with the second pair of accelerator cards of the acceleration unit A2 of the acceleration component B2; the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B2 is connected with the acceleration component B3 The first pair of accelerator cards of unit A1 is connected; the second pair of accelerator cards of acceleration unit A2 of acceleration unit B3 is connected to the second pair of accelerator cards of acceleration unit A1 of acceleration unit B4; the first pair of accelerator unit A1 of acceleration unit B4 is connected The accelerator card is connected to the first pair of accelerator cards of the acceleration unit A2 of the acceleration component B1.

Referring to FIG. 28 , FIG. 28 is a schematic diagram of a network topology of another acceleration device. The acceleration device 3005 shown in FIG. 28 is a specific form of the acceleration device 3004 shown in FIG. 27 , so the above related descriptions about the acceleration device 3004 can also be applied to the acceleration device 3005 in FIG. 28 . As shown in FIG. 28 , each acceleration component of the acceleration device 3005 can form a hybrid three-dimensional network unit, and the interconnection relationship within each hybrid three-dimensional network unit can be as shown in the figure, to realize the 8 nodes and 32 cards of the acceleration device 3005. interconnection. The four acceleration components can be interconnected with multiple cards and multiple nodes through, for example, QSFP-DD interfaces and cables, forming a matrix network topology.

Specifically, ports 0 of the accelerator cards MC2 and MC3 of the upper node of the acceleration component B1 in this embodiment may be respectively connected to the accelerator cards MC2 and MC3 of the lower node of the acceleration component B2, and MC0 and MC1 of the lower node of the acceleration component B2 The port 3 of the acceleration component B3 can be respectively connected with the MC0 and MC1 of the upper node of the acceleration component B3, and the ports 0 of the MC2 and MC3 of the lower node of the acceleration component B3 can be respectively connected with the MC2 and MC3 of the upper node of the acceleration component B4. Ports 3 of MC0 and MC1 of the upper node can be respectively connected to MC0 and MC1 of the lower node of the acceleration component B1. The interconnection between the hybrid three-dimensional networks set in this way can form two bidirectional ring structures (as described above in conjunction with Fig. 12b, Fig. 12c, Fig. 19b and Fig. 19c), which has the advantages of better reliability and security, etc. And it is suitable for deep learning training and has high computing efficiency. For the matrix network topology consisting of 8 nodes in the acceleration device 3005, the total number of Hops is 11 times.

Further, as shown in FIG. 28 , the first pair of accelerator cards and the second pair of accelerator cards in different acceleration units in the same acceleration assembly may be indirectly connected. For example, the accelerator cards MC0 and MC1 of the upper-layer acceleration unit in the acceleration component B1 are indirectly connected with the accelerator cards MC2 and MC3 of the lower-layer acceleration unit.

On the basis of the network topology in FIG. 28 , the matrix network topology can be further expanded into a larger network topology by taking the matrix network topology as the basic unit. FIG. 29 is a schematic diagram of the matrix network topology based on the wireless expansion of the acceleration device. As shown in FIG. 29 , the acceleration device 3006 may include multiple acceleration components, and each acceleration component (shown as a block in the figure) may include multiple acceleration units (a perspective view is not shown, please refer to the structure of the acceleration component in FIG. 28 ) ), each acceleration unit may include, for example, the interconnection of four acceleration cards as shown in the illustration, so the matrix network topology can theoretically expand infinitely.

Based on the acceleration device provided in FIG. 23 , please refer to FIG. 30 , which is a schematic diagram of the acceleration device in another embodiment of the present disclosure. The acceleration device 3008 may include m (m 2 ) acceleration components, and each acceleration component may include n(n 2) acceleration units, and m acceleration components can be connected in a ring. Among them, the acceleration unit An of the acceleration component B1 can be connected with the acceleration unit A1 of the acceleration component B2, the acceleration unit An of the acceleration component B2 can be connected with the acceleration unit A1 of the acceleration component B3, and so on to the acceleration component Bm, the acceleration component The acceleration unit An of Bm can be connected to the acceleration unit A1 of the acceleration assembly B1, so that the m acceleration assemblies are connected end to end in a ring connection.

Based on FIG. 30 , please refer to FIG. 31 . FIG. 31 is a schematic diagram of the network topology of another acceleration device. The acceleration device 3009 may include 6 acceleration components, each acceleration component may include two acceleration units, and the second acceleration unit of each acceleration component Each acceleration unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 12 nodes and 48 cards, forming a larger matrix network topology. The total Hop under this network topology is 13 times.

Based on FIG. 31 , please refer to FIG. 32 . FIG. 32 is a schematic diagram of a network topology of another acceleration device. The acceleration device 3010 includes 8 acceleration components, each acceleration component includes two acceleration units, and the second acceleration unit of each acceleration component The unit can be connected to the first acceleration unit of the next acceleration component, forming an interconnection of 16 nodes and 64 cards, forming a larger matrix network topology. The total Hop under this network topology is 17 times.

On the basis of Fig. 32, it can be extended vertically to form a super large-scale matrix network such as 20 nodes with 80 cards and 24 nodes with 96 cards. In theory, it can be extended infinitely, and the total number of Hops is the number of nodes plus one. By optimizing the interconnection between nodes, the delay of the entire system can be minimized, and the real-time requirements of the system can be met to the greatest extent while processing massive data.

The acceleration device including a plurality of acceleration components has been exemplarily described above with reference to FIGS. 23 to 32. Those skilled in the art can understand that the above description is exemplary rather than limiting, such as the number of acceleration components. , structure, and the connection relationship between acceleration components can be adjusted as needed. Those skilled in the art can also combine the above multiple embodiments to form an acceleration device as required, which is also within the protection scope of the present disclosure.

In addition, it should be noted that the accelerator card matrix, fully connected square network (topology), hybrid three-dimensional network (topology), matrix network (topology), etc. described in this disclosure are all logical, and the specific layout can be based on Adjustment is required.

The topology disclosed in the present disclosure can also perform data reduction operations. The reduction operation can be performed on each accelerator card, each accelerator unit and in the accelerator device. The specific operation steps can be as follows.

Taking the reduction and sum operation as an example, the reduction operation process performed in one acceleration unit may include: transferring the data stored in the first acceleration card to the second acceleration card, and comparing the original data in the second acceleration card. The data stored in the second accelerator card and the data received from the first accelerator card are added; then, the result of the addition operation in the second accelerator card is transferred to the third accelerator card, The addition operation is performed again, and so on, until all the data stored in the accelerator card are added, and each accelerator card has received the final operation result.

Taking the acceleration unit shown in Figure 11 as an example, the accelerator card MC0 stores data (0, 0), the accelerator card MC1 stores data (1, 2), and the accelerator card MC2 stores data (3, 1). And data (2,4) is stored in the accelerator card MC3. The data (0,0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1,2) can be obtained after the addition operation; then, the result (1,2) can be transferred to the accelerator card MC2, The next result (4,3) is obtained; then, the next result (4,3) is transferred to the accelerator card MC3 to obtain the final result (6,7).

After that, in the reduction operation of the present disclosure, the final result (6, 7) is continued to be transmitted to each of the accelerator cards MC0, MC1, MC2 and MC3, so that the data (6, 7) are stored in all the accelerator cards, Thus, the reduction operation is completed in one acceleration unit.

The acceleration unit shown in FIG. 11 can form two independent rings, and each ring can complete the reduction operation of half of the data, thereby speeding up the operation speed and improving the operation efficiency.

In addition, when the above-mentioned acceleration unit performs the reduction operation, it can also realize the concurrent calculation of multiple acceleration cards, thereby speeding up the operation speed. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data ( 2,4). Part of the data (0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1) is obtained after the addition operation, and part of the data (2) in the accelerator card MC1 can be transferred to the accelerator card MC2 synchronously. After the addition operation, the result (3) is obtained, thereby realizing the concurrent operation of the accelerator cards MC1 and MC2; and so on, the entire protocol operation is completed.

The above-mentioned concurrent calculation may further include that a group of acceleration units performs an addition operation first, and then performs a reduction operation on the operation result of this group of acceleration units and the operation result of another group of acceleration units. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data ( 2,4), the data in the accelerator card MC0 can be transferred to the accelerator card MC1 for operation to obtain the first set of results (1,2); synchronously or asynchronously, the data in the accelerator card MC2 can be transferred to the accelerator card Operations are performed in MC3 to obtain the second set of results (5,5). Next, the first set of results and the second set of results are operated to obtain the final reduction result (6,7).

Similarly, in addition to performing reduction operations in an acceleration unit, reduction operations may also be performed in acceleration components or acceleration devices. It should be understood that the acceleration device can also be considered as an acceleration component connected end to end.

When performing a reduction operation in an acceleration component or an acceleration device, it may include: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result in each acceleration unit; The second reduction operation is performed on the first reduction result of , to obtain the second reduction result.

Also taking the reduction sum operation as an example, the first step above has been described above. For an acceleration device including multiple acceleration units, a local reduction operation can be performed in each acceleration unit first. After the reduction operation in the acceleration unit is completed, the accelerator card in the same acceleration unit will obtain the result of the local reduction operation, which is referred to as the first reduction result here.

Next, the first reduction results in all acceleration units may be transferred and added to adjacent acceleration units. Therefore, similar to the reduction operation performed in one acceleration unit, the first acceleration unit transmits the first reduction result to the second acceleration unit, and after the addition operation is performed in the accelerator card of the second acceleration unit, the result is performed. pass and add operations. After the last addition, the final result is passed to each acceleration unit.

It should be pointed out that since the acceleration components above are not necessarily connected end-to-end, in the case of transmitting the final result to each acceleration unit, it can be conducted in reverse, instead of cyclic transmission as when the acceleration units are connected end-to-end. The technical solution of the present disclosure does not specifically limit how to conduct the final result.

Further, according to an embodiment of the present disclosure, the acceleration device may also be configured to perform a reduction operation, including: performing a first reduction operation on the data in the acceleration card of the same acceleration unit to obtain a first reduction result; Perform an intermediate reduction operation on the first reduction results in the multiple acceleration units of the acceleration component to obtain an intermediate reduction result; perform a second reduction operation on the intermediate reduction results in the multiple acceleration components to obtain a second reduction result.

In this implementation manner, the reduction operation may be performed first in the same acceleration unit, which has been described above, and will not be repeated here.

Next, a reduction operation can be performed in each acceleration component, so that each acceleration card in each acceleration component can obtain the local reduction result in this acceleration component; The reduction operation is performed in the acceleration device, so that each acceleration card can obtain the global reduction result in the acceleration device.

Various implementations of the accelerator card system have been described above, and a more specific method for communication based on the accelerator card system will be described below.

In the present disclosure, the communication task queue and the communication task execution queue are separated, and distinguishing the communication task queue and the communication task execution queue can enable users to perform tasks such as fault tolerance or retransmission without perception.

In the present disclosure, a communication task may be delivered to any accelerator card in the accelerator card system as an asynchronous task, and a communication task queue may be formed, and the communication task queue and the communication task execution queue may be located on different accelerator cards. Communication tasks in the same communication queue will be executed sequentially. The execution of these communication tasks can be completed by another accelerator card, so that the communication task queue and the communication task execution queue are located on different accelerator cards.

Preferably, for some communication tasks with a relatively large amount of data, it can also be divided into multiple communication tasks for execution. The tasks can be executed concurrently. Therefore, after a total communication task is divided into multiple sub-communication tasks and parallelized, the execution efficiency of the communication task will be greatly improved. For communication tasks such as Allreduce, data can be transmitted from one accelerator card to another through different communication paths. Therefore, when a total communication task is divided into multiple sub-communication tasks executed in parallel, different communication paths to perform these sub-communication tasks.

Taking the connection mode of the accelerator card shown in Figure 12a to Figure 12c as an example, when data needs to be transmitted from the accelerator card MC1 to the accelerator card MC3, the data can be transmitted through the following multiple communication paths:

1. As shown in Figure 12b, data can be transmitted from port 1 of accelerator card MC1 to port 1 of accelerator card MC0, and then from port 2 of accelerator card MC0 to port 2 of accelerator card MC2, and finally from the accelerator card. The port 1 of the card MC2 is sent to the port 1 of the accelerator card MC3;

2. As shown in Figure 12b, the data can also be directly transmitted from the No. 2 port of the accelerator card MC1 to the No. 2 port of the accelerator card MC3;

3. As shown in Figure 12c, data can be transmitted from port 4 of accelerator card MC1 to port 4 of accelerator card MC2, and then from port 5 of accelerator card MC2 to port 5 of accelerator card MC0, and finally from the accelerator card. The 4th port of the card MC0 is sent to the 4th port of the accelerator card MC3;

4. As shown in Figure 12c, data can also be directly transmitted from port 5 of the accelerator card MC1 to port 5 of the accelerator card MC3.

It can be seen that, in the disclosed technical solution, the communication between the two accelerator cards can be performed through multiple communication paths, in other words, the communication between the two accelerator cards can be performed through different topological structures. Thus, when a total communication task is divided into a plurality of sub-communication tasks, each sub-communication task can be executed through a different communication path.

It should be understood that the above description in conjunction with Fig. 12a-Fig. 12c is only a simple example. When the accelerator card system includes more accelerator cards, the communication paths will be more complex and diverse, so that the total communication tasks can also be reduced. Divide into a greater number of sub-communication tasks.

In the communication task queue, multiple status flags can be set, and these status flags can monitor the execution of communication tasks and also control the execution of other communication tasks. The execution of the communication task will change the state flag, and the change of the state flag will correspondingly change the execution of other communication tasks. These status flags will be described in more detail below.

According to an embodiment of the present disclosure, although the communication task queue can be loaded on any one of the accelerator cards, preferably, the communication task queue can be loaded on the accelerator card with low load. It should be understood that when multiple accelerator cards are involved in computing and communication, the load of each accelerator card may be different, and an accelerator card with a low load may preferably be selected to carry the communication task queue. The accelerator card can receive communication tasks from the host or other accelerator cards, form a queue, and control the execution of each communication task in the queue. This method helps to make full use of accelerator card resources and improve the overall operating efficiency of the system.

Fig. 36a shows a flowchart of a method for executing a communication task according to an embodiment of the present disclosure; Fig. 36b shows a schematic diagram of a task issuing queue and a communication task execution queue according to an embodiment of the present disclosure.

As shown in FIG. 36a, according to an embodiment of the present disclosure, the method of the present disclosure further includes: in operation S3610, dividing a total communication task in the communication task queue into a plurality of sub-communication tasks, each sub-communication task be in different communication task execution queues; in operation S3620, execute the plurality of sub-communication tasks in parallel through different communication paths; and in operation S3630, in response to the completion of execution of the sub-communication tasks, so that the total communication task is performed Finished.

The above method will be described in detail below in conjunction with FIG. 36b.

In Figure 36b, two types of queues are included, namely the communication task queue LQ and the communication task execution queue PQ. The communication task queue can receive multiple total communication tasks, such as total communication tasks A, B and C, etc. When communication tasks A, B, and C enter the communication task queue LQ, they are serially combined, and the execution order is A, B, and C. That is, when communication task A is executed, communication tasks B and C need to wait. The communication task B can only be executed after the execution of the communication task A is completed; and the communication task C can be executed only after the execution of the communication task B is completed. Such a task execution method cannot make full use of the parallel operation resources of the system, especially when the execution time of a certain communication task is particularly long or the amount of communication data is particularly large, the execution of other communication tasks will be obviously blocked, and the system performance will be affected. .

The communication task in the communication task queue LQ can be regarded as a total communication task, and the total communication task is divided into multiple sub-communication tasks executed in parallel, and placed in the communication task execution queue PQ for execution. When a total communication task is divided into multiple sub-communication tasks to be executed in parallel, the execution efficiency of the task can be significantly improved.

In the present disclosure, taking the total communication task B as an example, the total communication task B can be divided into a plurality of sub-communication tasks b1, b2, etc., and two sub-communication tasks b1 and b2 are taken as an example for description here. It should be noted that the number of divided communication tasks may be other numbers, which may be determined according to the topology of the accelerator card system. For example, if there are more communication paths from one accelerator card to another, then the total communication task can be divided into more sub-communication tasks, and vice versa, the total communication task can be divided into fewer sub-communication tasks; or The larger the amount of data involved in the total communication task, the more sub-communication tasks it can be divided into.

When the total communication task B is divided into sub-communication tasks b1 and b2, and these sub-communication tasks are placed in different communication task execution queues PQ1 and PQ2 respectively, the communication tasks can be executed in parallel in the communication task execution queues PQ1 and PQ2. Two sub-communication tasks b1 and b2.

The execution of the general communication task B and the sub-communication task needs to meet the following rules: 1. When the general communication task B has not started to be executed, the sub-communication tasks b1 and b2 should also be in the unstarted state; 2. When the general communication task B starts to be executed time, the sub-communication tasks b1 and b2 should also start to be executed; 3. Other tasks (such as C) after task B in the communication task queue LQ need to wait for task B to be executed before they can be executed; 4. When all sub-communication tasks b1 and b2 are executed After the execution is completed, the general communication task B should also be executed.

FIG. 37a shows a flowchart of dividing a total communication task in the communication task queue into a plurality of sub-communication tasks according to an embodiment of the present disclosure.

Thus, according to an embodiment of the present disclosure, dividing a total communication task in the communication task queue into a plurality of sub-communication tasks S3610 includes: in operation S36110, setting a first write flag that allows the total communication task to start executing ; In operation S36120, set the first waiting mark that prohibits described sub-communication task starting to execute; And, at operation S36130, when described first writing mark is not carried out, carry out described first waiting mark to forbid described The sub-communication task starts to execute.

Figure 37b shows a schematic diagram of inserting a marker in a queue according to one embodiment of the present disclosure. The specific implementation of FIG. 37a will be described in detail below in conjunction with FIG. 37b.

First of all, in order to control the execution of the communication task B, a write flag needs to be set, in other words, a write flag needs to be inserted before the communication task to be executed, which is exemplarily represented as F0, only when the write flag F0 is executed , or when the write flag F0 is changed to allow the execution of the next task, the subsequent communication task B starts to be executed. However, if the writing flag F0 is not executed, the corresponding communication task does not start to be executed. The write flag can be inserted through an Atomic Operation. The so-called atomic operation refers to an operation that is not interrupted by the thread scheduling mechanism; once this operation starts, it runs until the end without any context switching in between.

Correspondingly, a waiting flag f0 may be inserted before each sub-communication task, and the waiting flag indicates that the execution of the sub-communication task after the flag is prohibited. It should be understood that, although the first writing mark F0 and the waiting mark f0 in FIG. 37b are named with different names, the writing mark F0 and the waiting mark f0 point to the same mark to detect whether the same mark occurs. Change. It should also be understood that the insertion positions of the first writing flag F0 and the waiting flag f0 shown in FIG. 37b are only for the convenience of understanding, and are not necessarily inserted into the sub-communication tasks as shown in FIG. 37b.

According to an embodiment of the present disclosure, executing the plurality of sub-communication tasks in parallel includes: in response to the first write flag being executed, turning off the first waiting flag, thereby executing the plurality of sub-communication tasks in parallel communication tasks.

The write flag F0 before the total communication task and the waiting flag f0 show an associated relationship. Only when the write flag F0 allows the execution of the subsequent total communication task, the waiting flag f0 is ended and the corresponding sub-communication task starts to run. The writing flag F0 does not allow the execution of the subsequent total communication task, and the waiting flag f0 also makes the execution of the sub-communication task in a waiting state.

Figure 38 shows a schematic diagram of a queue according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, a second waiting flag may be set to prohibit execution of other communication tasks after the general communication task.

As shown in Figure 38, a second waiting mark can be inserted after the first writing mark. When the second waiting mark is executed, it indicates that other general communication tasks after the current general communication task need to be in a waiting state. Before the current general communication task is executed, other general communication tasks cannot start to be executed.

According to the above description, it can be seen that when the first write flag F0 is executed, the general communication task B corresponding to the first write flag F0 starts to be executed, that is, the sub-communication tasks b1 and b2 of the general communication task B are executed. End the waiting state and start execution; after that, when the second waiting flag F1 is executed, other tasks after the general communication task B enter the waiting state, and are not executed when the general communication task B is executed.

FIG. 39 shows a schematic diagram of setting the second waiting flag to be modified according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, every time one sub-communication task is executed, the second waiting flag F1 is modified until all sub-communication tasks are executed; and in response to all sub-communication tasks being executed, the second The waiting flag F1 is modified into a waiting end flag, so that the execution of the total communication task is completed.

Next, as shown in FIG. 39, each sub-communication task b1 and b2 is executed in the execution queue PQ, and whenever a sub-communication task b1 or b2 is executed, the second waiting flag F1 can be modified accordingly, for example, it can be The second waiting flag F1 is incremented by one. The number of times that the second waiting flag F1 is modified is the same as the number of times that the sub-communication tasks are executed. Therefore, the second waiting flag F1 can be initially set with a target value, and as the sub-communication task b1 or b2 is executed, the second waiting flag F1 gradually approaches the target value, and when the second waiting flag F1 reaches the preset value When the target value is reached, it means that all sub-communication tasks b1 and b2 have been executed. It should be understood that there can be many ways to modify the second waiting flag F1, and it is not limited to "adding one" as described above, for example, one can be subtracted for each modification, until the second waiting flag F1 less than a predetermined threshold. The present disclosure does not make any limitation on how to modify the second waiting flag.

The above-mentioned "the second waiting flag F1 reaches the target value" can also be understood as a waiting end flag, which means that the current general communication task B has been executed, and other tasks can be started.

It should be understood that in Figure 37b, Figure 38 and Figure 39, although the flag f0 is shown in the sub-communication task, this is only for the convenience of understanding, and the flag f0 is actually in the communication task queue, so as to facilitate each The execution of sub-communication tasks is monitored.

When dividing the total communication task into a plurality of sub-communication tasks, there can be a variety of division methods, the total communication task can be randomly divided into a plurality of sub-communication tasks; the total communication task can be divided into a fixed number of sub-communication tasks; The total communication task may be divided into a number of sub-communication tasks corresponding to the number of processors, etc., according to the number of communication paths.

According to a preferred embodiment of the present disclosure, a total communication task in the task queue may be divided into a plurality of sub-communication tasks with equivalent execution time.

The above execution time equivalence does not mean that the size of each sub-communication task itself is the same. For example, if the communication speed of each port is 40Gbps, then for 160G data, if one communication path is used to transmit these data, it theoretically takes 4 seconds to complete the data transmission. Therefore, the 160G data can be split into multiple sub-communication tasks, such as 2, 3 or 4. When it is divided into 4 sub-communication tasks, 4 communication paths can be used to transmit data in parallel. In theory, it only takes 1 second to complete 160G data transmission, and the communication time is only 25% of the original communication time. . Obviously, this will help to shorten the execution time of the total communication task.

In addition, each communication path in the multiple communication paths does not necessarily have the same transmission speed. Therefore, for the division of communication tasks, the speed of each communication path can be considered to adjust the data corresponding to each sub-communication task. size. For example, the transmission speeds of the four communication paths are 16Gbps, 18Gbps, 22Gbps and 24Gbps respectively, then for 160G data, it can be divided into four sub-communication tasks, respectively 32G, 36G, 44G and 48G, The time for each communication path to complete data transmission is 2 seconds, thereby ensuring that each communication path can complete each communication task at the same time or substantially at the same time.

Different communication paths may correspond to different topologies, as described in Figures 12a-12c.

According to an embodiment of the present disclosure, wherein, in response to a data amount of the total communication task exceeding a certain threshold, the total communication task may be divided into a plurality of sub-communication tasks. It should be understood that dividing the total communication task into multiple sub-communication tasks also needs to consider the total amount of data involved in each task. If the total amount of data involved in a total communication task is small, then it is not necessary to Total communication tasks are divided.

According to one embodiment of the present disclosure, the method of the present disclosure further includes: in response to an error in one or more sub-communication tasks, re-running the sub-communication task in which the error occurred.

When multiple sub-communication tasks are executed in the communication task execution queue PQ, errors may occur, such as transmission failure caused by data lines, data throughput errors, and packet loss in data transmission. In the traditional solution, if the total communication task is not divided into multiple sub-communication tasks, once an error occurs during the execution of the task, the entire total communication task needs to be re-executed, which will seriously waste the processing capacity and cause the system Degradation of overall performance.

In the solution of the present disclosure, since multiple sub-communication tasks are in different execution queues, these execution queues run independently and do not interfere with each other, so even if an error occurs in a certain sub-communication task during the execution process, the It will affect the execution of other sub-communication tasks. Therefore, if an error occurs in the execution of a sub-communication task, only the sub-communication task in which the error has occurred can be re-executed without re-running all the sub-communication tasks or the entire communication task as a whole. When the faulty sub-communication task is running, other queues may be in an idle state, or other sub-communication tasks may be executed simultaneously. Therefore, in the present disclosure, dividing a total communication task into a plurality of parallel sub-communication tasks can improve the utilization rate of system processing resources and improve processing efficiency.

According to an embodiment of the present disclosure, in response to an error occurred in one or more sub-communication tasks, the sub-communication task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.

When an error occurs in a sub-communication task and needs to be re-executed, the sub-communication task with the error can be added to the communication task allocation queue LQ as a new total communication task, and the sub-communication task is further divided into multiple sub-tasks, And re-execute the faulty sub-communication task in multiple parallel execution queues PQ. Dividing the faulty sub-communication task into multiple sub-tasks for re-execution further improves the operating efficiency of the system, so that even if an error occurs in the execution of a sub-communication task, the time and processing resources for correcting the error are greatly reduced.

It can be seen that, based on the above solution of the present disclosure, when the amount of communication data is large, a communication task can be divided into any number of sub-tasks, and sent to multiple different communication task execution queues for concurrent execution, thereby increasing the bandwidth utilization.

Further, different algorithms can be selected for communication according to the maximum bandwidth of the specific topology connection mode between chips, so as to further optimize the communication efficiency of large data volumes. Therefore, according to the physical topology connection between the accelerator cards, a communication logical topology can be flexibly constructed for data communication, and the communication efficiency can be further improved.

It can also independently supervise the sub-communication tasks executed in each communication task execution queue. If there is an error in the sub-communication task, you only need to re-issue or re-execute the sub-communication task, and do not need to re-execute the entire communication task. . Therefore, the partial retransmission processing of the communication task can be realized without the user's perception, the fault tolerance of the communication task and the cost of retransmission are reduced, and the overall communication efficiency is improved.

The accelerator card that delivers the task can be any accelerator card in the accelerator card system. According to an embodiment of the present disclosure, since the communication task queue only has waiting and writing operations, the real communication task is executed in the communication task execution queue, so the communication task queue can correspond to any one of the accelerator card systems, which includes It is beneficial to reduce the probability of programming errors by developers, and an accelerator card with a small amount of tasks can be selected to perform the waiting and write control of the communication task queue.

FIG. 33 is a schematic structural diagram of a combined processing apparatus according to an embodiment of the present disclosure.

FIG. 33 shows a combined processing device 3300 , which includes the aforementioned acceleration unit 3301 , an interconnection interface 3302 , other processing devices 3303 and a storage device 3304 . The computing device according to the present disclosure interacts with other processing devices to jointly complete the operation specified by the user. Figure 33 is a schematic diagram of a combined treatment device.

Other processing devices include one or more processor types among general-purpose/special-purpose processors such as a central processing unit (CPU), a graphics processing unit (GPU), and a neural network processor. The number of processors included in other processing devices is not limited. Other processing devices serve as the interface between the machine learning computing device and external data and control, including data transfer, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete computing tasks.

The interconnect interface is used to transfer data and control instructions between computing devices (including, for example, machine learning computing devices) and other processing devices. The computing device obtains the required input data from other processing devices and writes it into the storage device on the computing device chip; it can obtain control instructions from other processing devices and write it into the control cache on the computing device chip; it can also read the computing device The data in the storage module is transmitted to other processing devices.

Optionally, the structure may further include a storage device 2608, and the storage device is respectively connected to the computing device and the other processing device. The storage device is used to save the data in the computing device and the other processing devices, and is especially suitable for data that cannot be fully stored in the internal storage of the computing device or other processing devices.

The combined processing device can be used as an SOC system for mobile phones, robots, drones, video surveillance equipment and other equipment, effectively reducing the core area of the control part, improving the processing speed and reducing the overall power consumption. In this case, the interconnection interface of the combined processing device is connected to certain components of the apparatus. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.

In some embodiments, the present disclosure also discloses a board including the above chip package structure. Referring to FIG. 34, an exemplary board 3400 is provided. In addition to the above-mentioned chip, the above board 3400 may also include other supporting components, including but not limited to: a storage device 3401, an interface device 3407, Control device 3405 and acceleration unit 3406.

The storage device is connected to the chip in the chip package structure through a bus, and is used for storing data. The memory device may include groups of memory cells 3402 . Each group of the memory cells is connected to the chip through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).

DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include four sets of the storage units. Each group of the memory cells may include a plurality of DDR4 granules (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers, and 64 bits of the above 72-bit DDR4 controllers are used for data transmission, and 8 bits are used for ECC verification. In one embodiment, each set of said memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling the DDR is provided in the chip for controlling data transmission and data storage of each of the memory cells.

The interface device is electrically connected to the chip in the chip package structure. The interface device is used to realize data transmission between the chip and an external device 3408 (eg, a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted by the server to the chip through a standard PCIE interface to realize data transfer. In another embodiment, the interface device may also be other interfaces, and the present disclosure does not limit the specific expression forms of the above-mentioned other interfaces, as long as the interface unit can realize the switching function. In addition, the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.

The embodiments of the present disclosure have been introduced in detail above, and specific examples are used to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure; at the same time, for Persons of ordinary skill in the art, based on the idea of the present disclosure, will have changes in the specific implementation manner and application scope. In summary, the contents of this specification should not be construed as limiting the present disclosure. 2021100550976

Claims

A method of executing asynchronous tasks, including:

Divide a total task in the task queue into multiple sub-tasks, and each sub-task is in a different sub-task queue;

executing the plurality of subtasks in parallel;

In response to the completion of the execution of the sub-task, the execution of the total task is completed.
The method according to claim 1, wherein dividing a total task in the task queue into a plurality of sub-tasks comprises:

inserting a first write identifier that allows the total task to start executing in the queue;

Insert the first waiting sign that forbids the start of execution of the sub-task in the sub-task queue;

When the first writing flag is not executed, the first waiting flag is executed to prohibit the sub-task from starting to be executed.
The method of claim 2, wherein executing the plurality of subtasks in parallel comprises:

In response to the first write flag being executed, the first wait flag is turned off, thereby executing the plurality of subtasks in parallel.
The method according to any one of claims 1-3, further comprising: inserting a second waiting flag in the total task queue to prohibit execution of other tasks after the total task.
The method of claim 4, further comprising:

Whenever a sub-task is executed, the second waiting flag is modified until all sub-tasks are executed;

In response to the completion of execution of all sub-tasks, the second waiting flag is modified to a waiting end flag, so that the execution of the total task is completed.
The method according to any one of claims 1-5, wherein a total task in the task queue is divided into a plurality of sub-tasks with equivalent execution time.
The method of any one of claims 1-6, wherein the total task is divided into a plurality of sub-tasks in response to a data amount of the total task exceeding a certain threshold.
7. The method of any one of claims 1-7, further comprising: in response to one or more subtasks having an error, re-running the erroneous subtask.
According to the method of any one of claims 1-8, in response to an error in one or more sub-tasks, the sub-task in which the error occurred is further divided into a plurality of sub-tasks for parallel execution.
The method according to any one of claims 1-9, wherein the task queue is a communication task queue.
An apparatus for executing asynchronous tasks, comprising:

A division unit, configured to divide a total task in the task queue into multiple sub-tasks, and each sub-task is in a different sub-task queue;

a sub-task execution unit configured to execute the plurality of sub-tasks in parallel;

The ending unit is configured to complete the execution of the total task in response to the completion of the execution of the sub-tasks.
A chip comprising the apparatus of claim 11 .
An electronic device comprising the chip of claim 12 .
A method for performing a communication task in an accelerator card system, wherein the accelerator card system includes a plurality of accelerator cards that can communicate with each other, and one accelerator card in the plurality of accelerator cards can communicate with another accelerator card through a communication path communicating; the method includes:

establishing a communication task queue, the communication task queue includes a communication task and a state identifier for monitoring the execution state of the communication task;

Establish a communication task execution queue for executing communication tasks between accelerator cards through a communication path;

In response to the execution of the communication task, the status identifier is changed to monitor the execution status of the communication task.
An electronic device comprising:

one or more processors; and

a memory having computer-executable instructions stored therein which, when executed by the one or more processors, cause the electronic device to perform any one of claims 1-10 and 14 the method described.
A computer-readable storage medium comprising computer-executable instructions, when the computer-executable instructions are

The one or more processors are running to perform the method of any one of claims 1-10 and 14.