CN114185831A

CN114185831A - Accelerating unit and electronic equipment

Info

Publication number: CN114185831A
Application number: CN202010969307.8A
Authority: CN
Inventors: 不公告发明人
Original assignee: Anhui Cambricon Information Technology Co Ltd
Current assignee: Anhui Cambricon Information Technology Co Ltd
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2022-03-15

Abstract

The present disclosure relates to an acceleration unit, an acceleration assembly, an acceleration device, and an electronic apparatus. Wherein the acceleration unit may be included in a combined processing device which may also include an interconnect interface and other processing devices. And the acceleration unit interacts with other processing devices to jointly complete the calculation operation designated by the user. The combined processing device may further comprise a storage device connected to the acceleration unit and the other processing device, respectively, for data services of the acceleration unit and the other processing device. With the aid of the present disclosure, high-speed processing of mass data is achieved.

Description

Accelerating unit and electronic equipment

Technical Field

The present disclosure relates generally to the field of processor technology. More particularly, the present disclosure relates to an acceleration unit, an acceleration assembly, an acceleration device, a circuit board, and an electronic apparatus.

Background

Currently, with the rapid development of Artificial Intelligence (AI) and Machine Learning (Machine Learning), the demand for ultra-high performance processors will be greater and greater in the future, and meanwhile, the demand for data processing in the big data era is higher. The high-performance processor and the cluster need to complete real-time processing of mass data, complete training and reasoning of a complex model within a specified time, and the like. ASIC (application Specific Integrated Circuit) is a dedicated acceleration chip that can be used to train deep neural networks. The ASIC can complete the work in a shorter time, much less data center infrastructure than non-parallel processing supercomputers use.

However, when a large amount of data is encountered, a single ASIC is still powerful and inevitably incompletable, and in order to obtain a stronger calculation power, a common scheme adopts a plurality of ASIC acceleration chips. However, for multi-card networks formed by interconnection of a plurality of ASICs, the ultra-high data throughput poses a significant challenge to the data transmission bandwidth of the ASICs. Therefore, how to design an interconnection scheme among the chips to improve the computing power of the whole system and achieve the purpose of efficiently processing mass data becomes a key technical problem for constructing a high-performance processor cluster.

Disclosure of Invention

In order to solve the above technical problem, the present disclosure provides an acceleration unit, an acceleration assembly, an acceleration device, a circuit board, and an electronic apparatus capable of improving computing power.

In one aspect, the present disclosure provides an accelerator unit comprising M local unit accelerator cards, each local unit accelerator card comprising an internal connection port, each local unit accelerator card being connected to other local unit accelerator cards through the internal connection port, wherein the M local unit accelerator cards are logically formed as an accelerator card matrix of size L x N, L and N being integers not less than 2.

In yet another aspect, the present disclosure provides an electronic device including the acceleration unit as described above.

In the scheme disclosed by the disclosure, the acceleration unit is composed of a plurality of acceleration cards, and for the plurality of acceleration cards, each acceleration card is connected with other acceleration cards through an internal connection port of the acceleration card to realize interconnection among the acceleration cards, so that the calculation capacity of the acceleration unit can be effectively improved, and the speed of processing mass data is favorably improved. In addition, for the acceleration assembly and the acceleration device, the time delay of the whole system can be minimized through the interconnection mode among the acceleration units, the requirement of the system on real-time performance while processing mass data can be met to the maximum extent, and the method is favorable for improving the computing capacity of the whole system and achieving the purpose that the system processes mass data at a high speed.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1a is a schematic diagram illustrating a structure of an acceleration unit according to an embodiment of the present invention

FIG. 1b, FIG. 2, FIG. 3, FIG. 4 and FIGS. 5 a-5 c are schematic structural diagrams of an accelerating unit according to an embodiment of the disclosure;

6-11 are various schematic structural views of an acceleration assembly according to an embodiment of the disclosure;

12 a-12 c are schematic diagrams of acceleration components represented as a network topology;

FIG. 13 is a schematic view of an acceleration device including multiple acceleration units according to an embodiment of the present disclosure;

FIG. 14 is a diagram illustrating a network topology corresponding to an acceleration device in one embodiment;

FIG. 15 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment;

16-20 are schematic diagrams of an acceleration device including acceleration assemblies according to embodiments of the present disclosure;

FIG. 21 is a schematic diagram of a network topology of yet another acceleration device;

FIG. 22 is a schematic diagram of a matrix network topology based on wireless expansion of an acceleration device;

FIG. 23 is a schematic view of an accelerator apparatus according to yet another embodiment of the disclosure;

FIG. 24 is a schematic diagram of a network topology of yet another acceleration device;

FIG. 25 is a schematic diagram of a network topology of yet another acceleration device;

FIG. 26 is a schematic view of a combination device according to an embodiment of the present disclosure;

fig. 27 is a schematic structural diagram of a circuit board in an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

FIG. 1a is a schematic diagram illustrating a structure of an acceleration unit according to an embodiment of the present invention. According to one embodiment of the present disclosure, there is provided an accelerator unit including M local unit accelerator cards, each local unit accelerator card including an internal connection port, each local unit accelerator card being connected to other local unit accelerator cards through the internal connection port, wherein the M local unit accelerator cards are logically formed as an accelerator card matrix of L × N scale, L and N being integers not less than 2.

As shown in fig. 1a, an accelerator card matrix may be formed by a plurality of accelerator cards, which are connected to each other to enable data or command transfer and communication. Such as an accelerator card MC₀₀To MC_0N Line 0 of the accelerator card matrix is formed, accelerator card MC₁₀To MC_1N Form line 1 of the accelerator card matrix and so on, the accelerator card MC_L0To MC_LNForming the lth row of the accelerator card matrix.

It is to be understood that, for the sake of context understanding, an accelerator card in the same acceleration unit is referred to as "own unit accelerator card", and accelerator cards in other acceleration units are referred to as "external unit accelerator cards". Such designations are merely for convenience of description and do not limit the technical aspects of the present disclosure.

Each accelerator card may have a plurality of ports, and these ports may be connected to the accelerator card of the present unit, or may be connected to an accelerator card of an external unit. In the present disclosure, a connection port between the present unit accelerator cards may be referred to as an internal connection port, and a connection port between the present unit accelerator cards and the external unit accelerator cards may be referred to as an external connection port. It is to be understood that the external port and the internal port are merely for convenience of description, and the same port may be used for both. This will be described below.

It is to be understood that M may be any integer, and that M accelerator cards may be formed into a matrix of 1 × M or M × 1, or that M matrices may be formed into other types of matrices. The acceleration units of the present disclosure are not limited to a particular matrix size and form.

Furthermore, the accelerator cards, such as the accelerator card of the unit, and the accelerator card of the unit and the accelerator card of the external unit, can be connected through a single or multiple communication paths. This will be described in detail later.

It should also be understood that in the context of the present disclosure, although the positions between the accelerator cards are all described in terms of a rectangular network, in practice, the matrix formed is not necessarily in the form of a matrix in physical spatial arrangement, but may be in any position, for example, the accelerator cards may form a straight line or the accelerator cards may be irregularly arranged. The matrix described above is only logical as long as the connections between the accelerator cards form a matrix relationship.

According to one embodiment of the present disclosure, M may be 4, whereby 4 present-unit accelerator cards may be logically formed as a2 x 2 accelerator card matrix; m may be 9, whereby 9 of the present unit accelerator cards may be logically formed into a3 x 3 accelerator card matrix; m may be 16, whereby 16 present-unit accelerator cards may be logically formed as a 4 x 4 accelerator card matrix. M may also be 6, whereby 6 present-unit accelerator cards may be logically formed as a2 x 3 or 3 x 2 accelerator card matrix; m may also be 8, whereby 8 present-unit accelerator cards may be logically formed as a2 x 4 or 4 x 2 accelerator card matrix.

According to one embodiment of the present disclosure, each of the unit accelerator cards is connected to at least one other of the unit accelerator cards via two paths.

In the topology described in the present disclosure, two accelerator cards of the present unit may be connected through a single communication path, or may be connected through multiple (e.g., two) paths, as long as the number of ports is sufficient. The connection via multiple communication paths facilitates securing the reliability of the communication between the accelerator cards, as will be explained and described in more detail in the examples below.

According to one embodiment of the present disclosure, the accelerator cards of the diagonal local unit at the four corners in the accelerator card matrix are connected by two paths. For a matrix, it may be preferable to connect two pairs of accelerator cards at opposite corners of the matrix, and for some topologies, the connection of the accelerator cards at diagonal positions may help form two complete communication loops. This will be explained and described in more detail in the examples below.

More specifically, according to one embodiment of the present disclosure, at least one of the unit accelerator cards may include an external port. For example, each acceleration unit may include four present unit acceleration cards, each present unit acceleration card may include six ports, and four ports of each present unit acceleration card are internal ports for connecting with three other present unit acceleration cards; the other two ports of at least one acceleration card of the unit are external ports and are used for being connected with an external unit acceleration card.

It should be understood that four ports of the six ports of each accelerator card of the present unit may be used to connect the accelerator card of the present unit, and the two ports left free may be used to connect the accelerator cards of the other accelerator units. These spare ports may also be free ports, not connected to any external device, or connected directly or indirectly to other devices or ports.

For purposes of example and simplicity, the acceleration unit, acceleration component acceleration arrangement, and electronic device are described below with each acceleration unit including four accelerator cards. It is to be understood that each acceleration unit may include a greater or lesser number of acceleration cards.

For convenience of description, the acceleration unit may include four accelerator cards, that is, a first accelerator card, a second accelerator card, a third accelerator card, and a fourth accelerator card, where each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other three accelerator cards through the internal port.

Fig. 1b is a schematic structural diagram of an acceleration unit according to an embodiment of the present disclosure. The acceleration unit 100 includes four accelerator cards, which are accelerator card MC0, accelerator card MC1, accelerator card MC2, and accelerator card MC 3. For four accelerator cards, each accelerator card may include an external port and an internal port, the internal port of the accelerator card MC0 is connected with the internal ports of the accelerator cards MC1, MC2 and MC3, the internal port of the accelerator card MC1 is connected with the internal ports of the accelerator cards MC2 and MC3, and the internal port of the accelerator card MC2 is connected with the internal port of the accelerator card MC3, that is, the internal port of each accelerator card is connected with the internal ports of the other three accelerator cards. Information interaction among the four accelerator cards can be realized through interconnection of the internal ports of the four accelerator cards. The embodiment of the disclosure utilizes the interconnection among the four accelerator cards in the accelerator unit, which can improve the computing power of the accelerator unit and achieve the purpose of processing mass data at high speed, and makes the path between each accelerator card and other accelerator cards shortest and communication delay lowest.

As described above, the number of accelerator cards in the present disclosure may not be limited to four, but may be other numbers. For example, in one embodiment, the number N of the accelerator cards is equal to 3, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with the other two accelerator cards through the internal port to realize interconnection among the three accelerator cards. In another embodiment, the number N of the accelerator cards is equal to 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with the other four accelerator cards through the internal port to realize interconnection among the five accelerator cards, thereby improving the computing power of the accelerator unit and realizing high-speed processing of mass data. In another embodiment, the number N of the accelerator cards is greater than 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with all other accelerator cards through the internal port, so that interconnection among the N accelerator cards is realized, and high-speed processing of mass data is realized.

Based on the acceleration unit 100 provided in fig. 1b, further, each acceleration card and at least one other acceleration card may be connected through two paths. Specifically, there may be, for example, three connection modes: the first connection mode is that each accelerator card can be connected with one of the other three accelerator cards through two paths; the second way is that each accelerator card can be connected with two accelerator cards in the other three accelerator cards through two paths; a third way is that each accelerator card can be connected to three other accelerator cards by two paths, in which case it is not excluded that there are more ports per accelerator card. To facilitate understanding of the connection manner between the two paths, the first connection manner will be taken as an example and is exemplarily described with reference to fig. 2.

Fig. 2 is a schematic structural diagram of an acceleration unit according to another embodiment of the present disclosure. In the acceleration unit 200 shown in fig. 2, each accelerator card and at least one other accelerator card may be connected by two paths, for example, the illustrated accelerator card MC0 and accelerator card MC2 may be connected by two paths, and the illustrated accelerator card MC1 and accelerator card MC3 may be connected by two paths. According to the arrangement, two links (or paths) for information interaction between the two accelerator cards can be provided, so that when one link fails, the other link is connected between the two accelerator cards, and the safety of the accelerator unit can be effectively improved.

While the foregoing describes an exemplary connection between an acceleration unit and multiple accelerator cards according to the present disclosure in conjunction with fig. 1 and 2, it will be understood by those skilled in the art that the foregoing description is exemplary and not limiting, for example, the arrangement of the accelerator cards in the acceleration unit may not be limited to the form shown in fig. 1 and 2, and in one embodiment, the four accelerator cards of the acceleration unit may be logically arranged in a quadrilateral arrangement, as will be described below in conjunction with fig. 3.

FIG. 3 is a schematic diagram of an accelerating unit according to another embodiment of the disclosure. In the acceleration unit 300 shown in FIG. 3, the four accelerator cards MC0, MC1, MC2, and MC3 may be logically arranged in a quadrilateral arrangement, with the four accelerator cards occupying the four vertex positions of the quadrilateral. The lines among the accelerator cards MC0, MC1, MC2 and MC3 are in a quadrilateral shape, so that the line arrangement is clearer, and the arrangement of the lines is convenient. It should be noted that the four accelerator cards shown in fig. 3 are arranged in a rectangular or 2 × 2 matrix, but this is a logic interconnection diagram, and for convenience of description, they are drawn in a rectangular form, and specific quadrangles may be freely arranged, such as parallelograms, trapezoids, squares, and the like. In actual layout and wiring, the four accelerator cards may be arranged arbitrarily, for example, in an actual overall machine, the four accelerator cards may be arranged in a line in parallel in the order of MC0, MC1, MC2, and MC 3. It should be further understood that the logical quadrangles described in the present embodiment are exemplary, and in fact, the arrangement shape of the multiple accelerator cards may vary widely, and the quadrangle is only one of them. For example, when the number of accelerator cards is five, the accelerator cards may be logically arranged in a pentagon.

Based on the connection relationship of the acceleration unit 200 provided in fig. 2, further, referring to fig. 4, fig. 4 is a schematic structural diagram of an acceleration unit in another embodiment of the present disclosure. In the acceleration unit 400 shown in FIG. 4, the four accelerator cards MC0, MC1, MC2, and MC3 may be logically arranged in a quadrilateral arrangement with the four accelerator cards occupying the four vertex positions of the quadrilateral, respectively. As further shown, the connection between the internal port of the accelerator card MC1 and the internal port of the accelerator card MC3 may be in two paths, and the connection between the internal port of the accelerator card MC0 and the internal port of the accelerator card MC2 may be in two paths. So to accelerating unit 400, not only the circuit sets up conveniently to the security has been promoted.

Fig. 5a is a schematic structural diagram of an acceleration unit according to an embodiment of the present disclosure. In the acceleration unit 500 shown in fig. 5a, the number labels on each acceleration card represent ports, and each acceleration card may include six ports, i.e., port 0, port 1, port 2, port 3, port 4, and port 5. The port 1, the port 2, the port 4 and the port 5 are internal ports, and the port 0 and the port 3 are external ports. For four accelerator cards MC0, MC1, MC2, and MC3, 2 external ports of each accelerator card may be connected to other accelerator units for interconnection among the plurality of accelerator units. The 4 internal ports of each accelerator card can be used to interconnect with the other three accelerator cards in the present accelerator unit.

As further shown in fig. 5a, four accelerator cards may be logically arranged in a quadrilateral, for example, accelerator card MC0 and accelerator card MC2 may be in a diagonal relationship, port 2 of MC0 is connected with port 2 of MC2, and port 5 of MC0 is connected with port 5 of MC2, i.e., there may be two links between accelerator card MC0 and accelerator card MC2 for communication. The accelerator card MC1 and the accelerator card MC3 may be in a diagonal relationship, port 2 of MC1 is connected with port 2 of MC3, and port 5 of MC1 is connected with port 5 of MC3, i.e. there may be two links between the accelerator card MC1 and the accelerator card MC3 for communication.

According to the arrangement, each accelerator card is provided with two external ports and four internal ports, and in two pairs of accelerator cards in a diagonal relationship, the two accelerator cards of each pair of accelerator cards can be connected by adopting the two internal ports to form two links, so that the safety and the stability of the accelerator unit can be effectively improved. And the four accelerator cards are arranged in a quadrilateral way in logic, so that the circuit layout of the whole accelerator unit is reasonable and clear, and the wiring operation in each accelerator unit is convenient. It should be further noted that, in the interconnection lines between the four accelerator cards shown in fig. 5b, the connection line between port 1 of the accelerator card MC1 and port 1 of the MC0, the connection line between port 2 of the accelerator card MC0 and port 2 of the MC2, the connection line between port 1 of the accelerator card MC2 and port 1 of the MC3, and the connection line between port 2 of the accelerator card MC3 and port 2 of the MC1 form an upright 8-shaped network, as shown in fig. 5 b. For the connection line between port 4 of accelerator card MC1 and port 4 of MC2, the connection line between port 5 of accelerator card MC2 and port 5 of MC0, the connection line between port 4 of accelerator card MC0 and port 4 of MC3, and the connection line between port 5 of accelerator card MC3 and port 5 of MC1, these four lines form a transverse 8-shaped network, as shown in fig. 5 c. The two fully-connected square networks can form a double-ring structure, and have the functions of redundancy backup and system reliability enhancement.

According to an embodiment of the present disclosure, the accelerator Card of the present disclosure may be a Mezzanine Card (MC Card), which may be a single circuit board. The MC card can be loaded with an ASIC chip and some necessary peripheral control circuits. The MC card may be connected to the substrate by a snap connector. Power and control signals on the substrate may be transmitted to the MC card through the buckle connector. According to another embodiment of the present disclosure, the internal port and/or the external port described in the present disclosure may be a SerDes port. For example, in one embodiment, each MC card may provide 6 bidirectional SerDes ports, each SerDes port has 8-way channels and a data transmission rate of 56Gbps, so that the total bandwidth of each port may be as high as 400Gbps, which can support massive data exchange between the accelerator card and the accelerator card, and facilitate the accelerator unit to process massive data at high speed.

SerDes, as described above, is a composite word of an english word Serializer (Serializer) and deserializer (De-Serializer), and is called a SerDes. The SerDes interface may be used to build a high performance processor cluster. The Serdes has the main functions of converting a plurality of paths of low-speed parallel signals into serial signals at a sending end, transmitting the serial signals through a transmission medium, and finally converting the high-speed serial signals into the low-speed parallel signals at a receiving end again, so that the Serdes is very suitable for the end-to-end long-distance high-speed transmission requirement. In another embodiment, the external port in the accelerator card can be connected to a QSFP-DD interface of another accelerator unit, wherein the QSFP-DD interface is an optical module interface commonly used in SerDes technology, and can be used for interconnection with other external devices in cooperation with a cable.

Further, according to another embodiment of the present disclosure, 4 accelerator cards can be mounted inside one accelerator unit, and the interconnection of the 4 accelerator cards can be completed by using a printed circuit board PCB. On a high-speed plate with a low dielectric constant, signal integrity can be guaranteed to the greatest extent through reasonable layout and wiring, and then the communication bandwidth among the four accelerator cards is guaranteed to tend to a theoretical value.

The acceleration unit disclosed by the disclosure has the advantages that for four acceleration cards, each acceleration card is connected with three other acceleration cards through the internal ports of the acceleration card, each acceleration card can directly communicate with three other acceleration cards, and the fully-connected network topology (fully-connected quad) is adopted as the communication architecture, so that the path between each acceleration card and the other acceleration cards is shortest, the total Hop number is minimum, and the delay is minimum. The present disclosure describes the time delay of the system in terms of Hop, which represents the number of hops in the communication, i.e., the number of communications. Hop specifically represents the shortest path from one node, from the initial node back to the initial node after traversing all nodes in the network. The 4 accelerator cards are interconnected, the formed fully-connected square network topology has the shortest delay, and a double-ring structure formed by interconnecting two diagonal accelerator cards can improve the robustness of the system, so that the service can still normally run when a single accelerator card fails. When various arithmetic logic operations are carried out, each ring in the double-ring structure can respectively complete a part of operations, so that the overall operation efficiency is improved, and the topological bandwidth is maximally utilized.

Embodiments of an acceleration unit according to the present disclosure have been described above with reference to fig. 1 a-5 c, and based on the above-described acceleration unit, an acceleration assembly is also disclosed that may comprise a plurality of the above-described acceleration units, and will be described below with reference to exemplary embodiments of the acceleration assembly.

FIG. 6 is a schematic diagram of an acceleration assembly according to an embodiment of the present disclosure. As shown in fig. 6, the acceleration assembly 600 may include n acceleration units, i.e., An acceleration unit a1, An acceleration unit a2, An acceleration unit A3, An acceleration unit An, wherein the acceleration unit a1 is connected with the acceleration unit a2 through An external port, and the acceleration unit a2 is connected with the acceleration unit A3 through An external port, i.e., each acceleration unit is connected with each other through An external port of the acceleration unit. In one embodiment, the external port of the acceleration card MC0 in acceleration unit a1 may be connected to the external port of the acceleration card MC0 in acceleration unit a2, and the external port of the acceleration card MC0 in acceleration unit a2 may be connected to the external port of the acceleration card MC0 in acceleration unit A3, i.e., each acceleration unit is connected via the external port of the acceleration card MC 0.

It will be appreciated by those skilled in the art that the connections between the acceleration units in the present disclosure may not be limited to the connection of the external port of the acceleration card MC0, but may also include, for example, one or more of the connection of the external port of the acceleration card MC1, the connection of the external port of the acceleration card MC2, and the connection of the external port of the acceleration card MC 3. That is, in the present disclosure, the connection manner of the acceleration unit a1 and the acceleration unit a2 may include: the external port of the MC0 in A1 is connected with the external port of the MC0 in A2, the external port of the MC1 in A1 is connected with the external port of the MC1 in A2, the external port of the MC2 in A1 is connected with the external port of the MC2 in A2, and the external port of the MC3 in A1 is connected with the external port of the MC3 in A2. Similarly, the connection of the acceleration unit a2 and the acceleration unit A3 may include: the external port of the MC0 in A2 is connected with the external port of the MC0 in A3, the external port of the MC1 in A2 is connected with the external port of the MC1 in A3, the external port of the MC2 in A2 is connected with the external port of the MC2 in A3, and the external port of the MC3 in A2 is connected with the external port of the MC3 in A3. And so on to the connection of acceleration unit An-1 to acceleration unit An. It should be noted that the above description is exemplary, and for example, the connection between different acceleration units may not be limited to the connection of the acceleration card corresponding to the reference number, and may be set as the connection of the acceleration card corresponding to the reference number according to the requirement.

It should be noted that, n acceleration units are shown in fig. 6, where n is greater than 3, but the number of acceleration units may not be limited to be greater than 3 in the illustration, and may also be set to be, for example, 2 or 3, the connection relationship between two acceleration units is the same as or similar to the connection relationship between the acceleration units a1 and a2, and the connection relationship between three acceleration units is the same as or similar to the connection relationship between the acceleration units a1, a2, and A3, which are not described herein again.

In addition, the structures of the plurality of accelerating units in the accelerating assembly may be the same or different, and the structures of the plurality of accelerating units are shown to be the same in fig. 6 for convenience of illustration, but in practice, the structures of the plurality of accelerating units may be different. For example, the layout of a plurality of accelerator cards in some accelerator units is polygonal, the layout of a plurality of accelerator cards in some accelerator units is in a line, the plurality of accelerator cards in some accelerator units are connected by a line, the plurality of accelerator cards in some accelerator assemblies are connected by two links, etc., some accelerator units include four accelerator cards, some accelerator units include three or five accelerator cards, etc., that is, the structure of each accelerator unit can be set independently, and the structures of different accelerator units can be the same or different.

The acceleration assembly disclosed by the disclosure can not only interconnect the acceleration cards inside the acceleration units in the acceleration assembly, but also interconnect the acceleration cards of different acceleration units, so that a hybrid three-dimensional network can be constructed. According to the arrangement, each acceleration card can process data and share the data through the interconnection among the acceleration units, and the data sharing can directly acquire the data, so that the data propagation path and time are reduced, and the data processing efficiency is improved.

FIG. 7 is a schematic view of an acceleration assembly according to another embodiment of the present disclosure. As shown in fig. 7, the acceleration assembly 700 may include n acceleration units, i.e., An acceleration unit a1, An acceleration unit a2, An acceleration unit A3. Thus, each acceleration card can share data through a high-speed serial link while processing data through high-speed operation by layer progressive configuration combination, infinite interconnection of the acceleration cards is realized, customizable computing power requirements are met, and flexible configuration of processor cluster hardware computing power is realized. As further shown in the figure, the acceleration units of each layer may include four acceleration cards, and the acceleration units may be logically arranged in a quadrilateral arrangement, and the four acceleration cards are respectively arranged at four vertex positions of the quadrilateral.

It should be understood by those skilled in the art that the acceleration assembly described above in connection with fig. 7 is exemplary and not limiting. For example, the structures of the plurality of acceleration units may be the same or different. The number of layers of the accelerating assembly can be 2, 3, 4 or more than 4, and the number of layers can be freely set according to needs. The number of connection paths between two connected accelerating units can be 1,2, 3 or 4 for each two connected accelerating units. For ease of understanding, the following exemplary description will be made in conjunction with fig. 8-12.

FIG. 8 is a schematic view of an acceleration assembly according to yet another embodiment of the present disclosure. As shown in fig. 8, the number of acceleration units in the acceleration assembly 701 may be 2, and two acceleration units are connected through a path, specifically, the acceleration unit a1 and the acceleration unit a2 may be implemented by connecting an external port of the acceleration card MC0 in the acceleration unit a1 with an external port of the acceleration card MC0 in the acceleration unit a2, for example.

As shown in fig. 9, the number of acceleration units in the acceleration assembly 702 may be 2, two acceleration units are connected through two paths, the external port of the acceleration card MC0 in the acceleration unit a1 is connected to the external port of the acceleration card MC0 in the acceleration unit a2, and the external port of the acceleration card MC1 in the acceleration unit a1 is connected to the external port of the acceleration card MC1 in the acceleration unit a 2. Thus, when one path fails, the other line supports communication among the acceleration units, and the safety of the acceleration assembly is further improved.

Referring now to fig. 10, fig. 10 is a schematic diagram of an acceleration assembly according to another embodiment of the present disclosure. As shown in fig. 10, in the acceleration module 703, the number of acceleration units may be 2, two acceleration units are connected by three paths, an external port of the acceleration card MC0 in the acceleration unit a1 is connected to an external port of the acceleration card MC0 in the acceleration unit a2, an external port of the acceleration card MC1 in the acceleration unit a1 is connected to an external port of the acceleration card MC1 in the acceleration unit a2, and an external port of the acceleration card MC2 in the acceleration unit a1 is connected to an external port of the acceleration card MC2 in the acceleration unit a 2. Thus, even when two paths are in failure, the other path supports communication between the acceleration units, and the safety of the acceleration assembly is further improved.

Referring now to fig. 11, fig. 11 is a schematic diagram of an acceleration assembly according to another embodiment of the present disclosure. In the acceleration module 704 shown in fig. 11, the number of acceleration units may be 2, and two acceleration units may be connected by four paths, for example, the external port of the acceleration card MC0 in the acceleration unit a1 is connected to the external port of the acceleration card MC0 in the acceleration unit a2, the external port of the acceleration card MC1 in the acceleration unit a1 is connected to the external port of the acceleration card MC1 in the acceleration unit a2, the external port of the acceleration card MC2 in the acceleration unit a1 is connected to the external port of the acceleration card MC2 in the acceleration unit a2, and the external port of the acceleration card MC3 in the acceleration unit a1 is connected to the external port of the acceleration card MC3 in the acceleration unit a 2. Thus, even when three paths fail, the other path supports communication between the acceleration units, and the safety of the acceleration assembly is further improved.

FIG. 12a is a schematic diagram of an acceleration component represented as a network topology. As shown in fig. 12a, the acceleration component 705 may include two acceleration units, each acceleration unit may include four acceleration cards, there may be two links between the acceleration card MC1 and the acceleration card MC3 in each acceleration unit, and there may be two links between the acceleration card MC0 and the acceleration card MC 2. The acceleration arrangement 705 of the left diagram of fig. 12a may form a three-dimensional representation as shown in the right diagram. In the right diagram of fig. 12a, circles represent accelerator cards, lines represent link connections, numeral 0 represents accelerator card MC0, numeral 1 represents accelerator card MC1, numeral 2 represents accelerator card MC2, and numeral 3 represents accelerator card MC 3. The right hand figure shows the acceleration component 705, but as another expression, a form of network topology is shown. The numbers embedded in the vertical lines in the right drawing indicate the port numbers of the connections, and for example, MC0, MC1, MC2, and MC3 in the two acceleration units are connected by port 0, port 3, and port 3, respectively.

For the right graph in fig. 12a, one acceleration unit is considered as one node, and two nodes have 8 acceleration cards, i.e. two nodes constitute a so-called 8-card interconnect. The one-machine-four-card interconnection relationship inside each node is definite, when two nodes are interconnected, MC0 and MC1 in an upper node (namely an acceleration unit A1) are respectively connected with MC0 and MC1 of a lower node (namely an acceleration unit A2) through a port 0; the MC2 and MC3 of the upper node are connected to the MC2 and MC3 of the lower node through the port 3, respectively, and this node topology is called Hybrid Cube network topology (Hybrid Cube Mesh), i.e., the acceleration component 705 is a Hybrid Cube network topology.

In the topology shown in fig. 12a with 8 cards, two separate rings may also be formed. This allows the maximum use of the topology bandwidth for the reduction operation, as shown in fig. 12b and 12c

In fig. 12b, accelerator cards MC1 and MC3 in accelerator unit a1 are connected through respective internal ports 5, accelerator cards MC0 and MC2 are connected through respective internal ports 5, and accelerator cards MC2 and MC3 are connected through respective internal ports 1; the accelerator card MC1 in the acceleration unit a1 and the accelerator card MC1 in the acceleration unit a2 are connected via the respective external single port 0, and the accelerator card MC0 in the acceleration unit a1 and the accelerator card MC0 in the acceleration unit a2 are connected via the respective external single port 0. Thus, a separate loop is formed in the 8 cards in FIG. 12.

In fig. 12c, accelerator cards MC1 and MC3 in accelerator unit a1 are connected through respective internal ports 2, accelerator cards MC0 and MC2 are connected through respective internal ports 2, and accelerator cards MC0 and MC1 are connected through respective internal ports 1; the accelerator card MC2 in the acceleration unit a1 and the accelerator card MC2 in the acceleration unit a2 are connected through respective external single ports 3, and the accelerator card MC3 in the acceleration unit a1 and the accelerator card MC3 in the acceleration unit a2 are connected through respective external single ports 3. Thus, another independent loop is formed in the 8 cards in fig. 12.

Only two exemplary ways of connection are shown above, but in practice the four connection paths between the two accelerating elements are effectively equivalent, so any one to three of these four paths may be used to connect the two accelerating elements and form a loop connection with the accelerator card within each accelerating element. And will not be described in detail herein.

FIG. 13 is a schematic view of an accelerator apparatus according to yet another embodiment of the disclosure. As shown in fig. 13, the acceleration apparatus 800 may include n acceleration units, i.e., An acceleration unit a1, An acceleration unit a2, An acceleration unit A3, An acceleration unit An, where the acceleration units in the acceleration apparatus 800 are logically in a multi-layer structure (shown by dotted lines), where the multi-layer structure may include odd layers or even layers, each layer may include one acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card of another acceleration unit through An external port, where the acceleration unit a1 is connected to the acceleration unit a2 through the external port, the acceleration unit a2 is connected to the acceleration unit A3 through the external port, and the acceleration unit An-1 is connected to the acceleration unit An through the external port. And the last acceleration unit may be connected to the first acceleration unit such that the acceleration units are connected end-to-end to form a ring configuration, such as the external port of the accelerator card MC0 of acceleration unit An being shown connected to the external port of the accelerator card MC0 of acceleration unit a 1. Thus, each acceleration card can share data through a high-speed serial link while processing data through high-speed operation by layer progressive configuration combination, infinite interconnection of the acceleration cards is realized, customizable computing power requirements are met, and flexible configuration of processor cluster hardware computing power is realized.

It should be noted that there are many cases of the connection relationship of the acceleration unit in the acceleration device in the present disclosure, and the detailed description has been given in the foregoing, and specific reference may be made to the description of the connection relationship of the acceleration unit in fig. 6, which is not described herein again. In addition, there are various ways for connecting the last acceleration unit with the first acceleration unit, which may specifically include: the external port of the MC0 in the acceleration unit A1 is connected with the external port of the MC0 in An, the external port of the MC1 in the acceleration unit A1 is connected with the external port of the MC1 in An, the external port of the MC2 in the acceleration unit A1 is connected with the external port of the MC2 in An, and the external port of the MC3 in the acceleration unit A1 is connected with the external port of the MC3 in An. For ease of understanding, the following exemplary description will be made in conjunction with fig. 14 and 15. In the following description, it will be understood by those skilled in the art that the accelerating device shown in fig. 14 and 15 is a variety of embodied expressions of the accelerating device 800 shown in fig. 13, and thus the description related to the accelerating device 800 of fig. 13 can also be applied to the accelerating device of fig. 14 and 15.

Referring to fig. 14, fig. 14 is a schematic diagram of a network topology corresponding to the acceleration device in an embodiment. The acceleration apparatus 801 shown in fig. 14 may be composed of four acceleration units, each circle represents an acceleration card, each line represents a link connection, numeral 0 in the circle represents an acceleration card MC0, numeral 1 represents an acceleration card MC1, numeral 2 represents an acceleration card MC2, and numeral 3 represents an acceleration card MC 3; the numbers embedded in the vertical lines in the figure represent the port numbers of the connections. The last acceleration unit is connected with the first acceleration unit, and the total hop number is 5 times. Each acceleration unit is a node, 4 nodes and 16 cards can be interconnected through interconnection among the nodes, and the four acceleration units form a small cluster, and are interconnected internally, namely a super computing cluster super pod. The topology is a main push mode of a super-large scale cluster, a high-speed SerDes port is adopted, the total Hop number is 5 times, and the delay is the lowest. The manageability of the cluster is better, and the robustness is better.

Referring to fig. 15, fig. 15 is a schematic diagram of a network topology corresponding to the acceleration device in another embodiment. Fig. 15 differs from fig. 14 in that the acceleration device 802 shown in fig. 15 has a larger number of acceleration units. As can be seen from the illustration, the last acceleration unit of the acceleration arrangement 802 is connected to the first acceleration unit. According to the acceleration apparatus thus configured, the total hop number is the number of nodes plus one, that is, the total hop number is the number of acceleration units plus one.

While the acceleration device including a plurality of acceleration units is exemplarily described above with reference to fig. 13 to 15, according to the technical solution of the present disclosure, there is also provided an acceleration device that may include a plurality of the aforementioned acceleration assemblies, which will be described in detail below with reference to a plurality of embodiments.

Fig. 16 is a schematic diagram of an acceleration apparatus according to still another embodiment of the disclosure, in which the acceleration apparatus 900 may include m acceleration modules, each of the acceleration modules may include a spare external port in addition to an external port required to perform connection between acceleration units inside the acceleration module, and the acceleration modules may be connected to each other through the spare external port, wherein the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B1 may be connected to the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B2, the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B2 may be connected to the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B3, and so on, the plurality of acceleration modules may be connected to each other. It is to be understood that the acceleration arrangement shown in fig. 16 is exemplary and not limiting, for example, the plurality of acceleration assemblies may be identical or different in structure. Also for example, the manner of connecting different acceleration components through the spare external port may not be limited to the manner shown in fig. 16, but may also include other manners. For ease of understanding, the following exemplary description will be made in conjunction with fig. 17-25.

Based on the acceleration device provided in fig. 16, further, referring to fig. 17, fig. 17 is a schematic diagram of a network topology corresponding to the acceleration device in yet another embodiment, the acceleration device 901 may include two acceleration components, the acceleration component B1 may include four acceleration units, the acceleration component B2 may include four acceleration units, a first acceleration unit in the acceleration component B1 is connected to a first acceleration unit in the acceleration component B2, and a last acceleration unit in the acceleration component B1 is connected to a last acceleration unit in the acceleration component B2. The total hop number for this network topology is 9. It will be understood by those skilled in the art that the network structure formed by multiple acceleration units in each acceleration component in fig. 17 is logical, and the arrangement positions of multiple acceleration units in practical application can be adjusted as required. The number of acceleration units in each acceleration assembly may not be limited to four as shown in the figure, and may be more or less, for example, six, eight, etc., as needed.

Based on the acceleration device provided in fig. 16, further referring to fig. 18, fig. 18 is a schematic diagram of an acceleration device according to still another embodiment of the present disclosure, the acceleration device 902 may include four acceleration components, i.e., acceleration components B1, B2, B3, and B4. Of the four acceleration modules, each acceleration module may include two acceleration units a1 and a2, and each acceleration module may be interconnected with one of a1 and a2 of the other acceleration module by one of the acceleration units a1 and a 2. For example, the acceleration unit a1 in the acceleration module B1 is connected to the acceleration unit a1 in the acceleration module B2, the acceleration unit a1 in the acceleration module B2 is connected to the acceleration unit a1 in the acceleration module B3, and the acceleration unit a1 in the acceleration module B3 is connected to the acceleration unit a1 in the acceleration module B4, wherein the connection is made through an external connection port of the acceleration unit.

It should be noted that the connection manner between the acceleration components may be various in addition to the connection manner shown in fig. 18. For example, the connection mode between the acceleration components may specifically include: the acceleration unit a1 or a2 in the acceleration module B1 is connected to the acceleration unit a1 or a2 in the acceleration module B2, the acceleration unit a1 or a2 in the acceleration module B2 is connected to the acceleration unit a1 or a2 in the acceleration module B3, and the acceleration unit a1 or a2 in the acceleration module B3 is connected to the acceleration unit a1 or a2 in the acceleration module B4.

Based on the acceleration device provided in fig. 18, further, please refer to fig. 19, and fig. 19 is a schematic view of an acceleration device in another embodiment of the present disclosure. In the acceleration device 903 shown in fig. 19, each acceleration component may be connected to one of the first acceleration unit and the second acceleration unit of the other acceleration component by two paths through one of the first acceleration unit and the second acceleration unit. For example, the first acceleration unit (e.g., acceleration unit a1) in the illustrated acceleration module B1 and the first acceleration unit (e.g., acceleration unit a1) in the illustrated acceleration module B2 may be connected by two paths, the acceleration unit a1 in the acceleration module B2 and the acceleration unit a1 in the acceleration module B3 are connected by two paths, and the acceleration unit a1 in the acceleration module B3 and the acceleration unit a1 in the acceleration module B4 are connected by two paths.

Note that fig. 19 indicates two paths connected, and actually may include a case where two or more paths are connected. The connection between the acceleration modules may include other modes in addition to the connection shown in fig. 19, for example, the acceleration unit a1 or a2 in the acceleration module B1 may be connected to the acceleration unit a1 or a2 in the acceleration module B2 by using two paths, the acceleration unit a1 or a2 in the acceleration module B2 may be connected to the acceleration unit a1 or a2 in the acceleration module B3 by using two paths, and the acceleration unit a1 or a2 in the acceleration module B3 may be connected to the acceleration unit a1 or a2 in the acceleration module B4 by using two paths.

Based on the acceleration device provided in fig. 16, further referring to fig. 20, fig. 20 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure, in which the acceleration device 904 includes four acceleration components, i.e., an acceleration component B1, an acceleration component B2, an acceleration component B3, and an acceleration component B4, each acceleration component includes two acceleration units, and each acceleration unit includes two pairs of acceleration cards. In each acceleration unit, MC0 and MC1 are a first pair of accelerator cards, and MC2 and MC3 are a second pair of accelerator cards. Wherein the second pair of accelerator cards of the accelerator cell a1 of the accelerator package B1 is connected to the second pair of accelerator cards of the accelerator cell a2 of the accelerator package B2; the first pair of accelerator cards of the accelerator cell a2 of the accelerator package B2 is connected to the first pair of accelerator cards of the accelerator cell a1 of the accelerator package B3; the second pair of accelerator cards of the accelerator cell a2 of the accelerator package B3 is connected to the second pair of accelerator cards of the accelerator cell a1 of the accelerator package B4; the first pair of accelerator cards of accelerator cell a1 of accelerator module B4 is coupled to the first pair of accelerator cards of accelerator cell a2 of accelerator module B1.

Referring to fig. 21, fig. 21 is a schematic diagram of a network topology of another acceleration device. The accelerator 905 shown in fig. 21 is an embodiment of the accelerator 904 shown in fig. 20, and thus the description above regarding the accelerator 904 can also be applied to the accelerator 905 in fig. 21. As shown in fig. 21, each acceleration component of the acceleration device 905 may form a hybrid stereo network unit, and the interconnection relationship inside each hybrid stereo network unit may be as shown in the figure, implementing interconnection of 8-node 32 cards of the acceleration device 905. The four acceleration components can realize the interconnection of multiple cards and multiple nodes through QSFP-DD interfaces and cables, and form a matrix network topology.

Specifically, in this embodiment, ports 0 of the accelerator cards MC2 and MC3 of the upper node of the acceleration component B1 may be connected to the accelerator cards MC2 and MC3 of the lower node of the acceleration component B2, ports 3 of MC0 and MC1 of the lower node of the acceleration component B2 may be connected to MC0 and MC1 of the upper node of the acceleration component B3, ports 0 of MC2 and MC3 of the lower node of the acceleration component B3 may be connected to MC2 and MC3 of the upper node of the acceleration component B4, and ports 3 of MC0 and MC1 of the upper node of the acceleration component B4 may be connected to MC0 and MC1 of the lower node of the acceleration component B1. The interconnection between the hybrid stereo networks arranged in this way can form two bidirectional ring structures (as described above in conjunction with fig. 5b, 5c, 12b, and 12 c), has the advantages of good reliability and safety, and the like, and is suitable for deep learning training with high operation efficiency. For the acceleration device 905, the total Hop number is 11 times for the matrix network topology composed of 8 nodes.

Further, as shown in fig. 21, the first pair of accelerator cards and the second pair of accelerator cards in different accelerator units in the same accelerator assembly may be indirectly connected. For example, the accelerator cards MC0 and MC1 of the upper acceleration unit in the acceleration module B1 are indirectly connected with the accelerator cards MC2 and MC3 of the lower acceleration unit.

Based on the network topology of fig. 21, a matrix network topology is taken as a basic unit, which can be further expanded into a larger network topology, and fig. 22 is a schematic diagram of a matrix network topology based on wireless expansion of an acceleration device. As shown in fig. 22, the acceleration device 906 may include a plurality of acceleration components, each acceleration component (shown as a block in the figure) may include a plurality of acceleration units (a perspective view is not shown, and refer to the acceleration component structure in fig. 21), and each acceleration unit may include, for example, four acceleration cards interconnected in the figure, so that the matrix network topology may be expanded infinitely in theory.

Based on the acceleration device provided in FIG. 16, further referring to FIG. 23, FIG. 23 is a schematic view of an acceleration device in yet another embodiment of the present disclosure, the acceleration device 908 may include m (m ≧ 2) acceleration units, each acceleration unit may include n (n ≧ 2) acceleration units, and the m acceleration units may be connected in a ring shape. The acceleration unit An of the acceleration module B1 may be connected to the acceleration unit a1 of the acceleration module B2, the acceleration unit An of the acceleration module B2 may be connected to the acceleration unit a1 of the acceleration module B3, and so on to the acceleration module Bm, and the acceleration unit An of the acceleration module Bm may be connected to the acceleration unit a1 of the acceleration module B1, so that the m acceleration modules are connected end to end in a ring shape.

Referring to fig. 24 based on fig. 23, fig. 24 is a schematic diagram of a network topology of another acceleration apparatus, the acceleration apparatus 909 may include 6 acceleration components, each acceleration component may include two acceleration units, and the second acceleration unit of each acceleration component may be connected to the first acceleration unit of the next acceleration component, so as to form an interconnection of 12 node 48 cards, forming a larger matrix network topology, where the total Hop of the network topology is 13 times.

Referring to fig. 25 based on fig. 24, fig. 25 is a schematic diagram of a network topology of another acceleration apparatus, where the acceleration apparatus 910 includes 8 acceleration components, each acceleration component includes two acceleration units, and the second acceleration unit of each acceleration component can be connected to the first acceleration unit of the next acceleration component, so as to form an interconnection of 16 nodes and 64 cards, forming a larger matrix network topology, where the total Hop of the network topology is 17 times.

On the basis of fig. 25, the method can be extended longitudinally all the time to form a super-large-scale matrix network such as 20 node 80 cards, 24 node 96 cards and the like. Theoretically, the method can be infinitely expanded, and the total Hop number is the number of nodes plus one. By optimizing the interconnection mode among the nodes, the time delay of the whole system can be minimized, and the requirement of the system on real-time performance while processing mass data can be met to the maximum extent.

While the above description of the acceleration device including a plurality of acceleration assemblies is provided in connection with fig. 16-25, it will be understood by those skilled in the art that the above description is illustrative and not limiting, and that, for example, the number, configuration, and connection relationship between the acceleration assemblies may be adjusted as desired. It is within the scope of the present disclosure that one skilled in the art may combine the above-described embodiments as desired to form an accelerator apparatus.

In addition, it should be noted that the accelerator card matrix, the fully-connected square network (topology), the hybrid stereo network (topology), the matrix network (topology), and the like described in the present disclosure are all logical, and the specific layout form may be adjusted as needed.

The topology disclosed in this disclosure may also perform data reduction operations. The reduction operation can be performed in each accelerator card, each accelerator unit and in the accelerator apparatus. The specific procedure may be as follows.

Taking the reduced sum operation as an example, the process of the reduced sum operation performed in an acceleration unit may include: transferring the data stored in the first accelerator card to a second accelerator card, and adding the data originally stored in the second accelerator card and the data received from the first accelerator card in the second accelerator card; and then, the addition operation result in the second accelerator card is transmitted to the third accelerator card, and then the addition operation is carried out, and so on until all the data stored in the accelerator cards are subjected to the addition operation, and each accelerator card receives the final operation result.

Taking the acceleration unit shown in fig. 4 as an example, the acceleration card MC0 stores data (0,0), the acceleration card MC1 stores data (1,2), the acceleration card MC2 stores data (3,1), and the acceleration card MC3 stores data (2, 4). The data (0,0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1,2) is obtained after the addition; next, the result (1,2) is passed to the accelerator card MC2, resulting in the next result (4, 3); the next result (4,3) is then passed to the accelerator card MC3 to obtain the final result (6, 7).

Thereafter, in the disclosed reduction operation, the final result (6,7) continues to be passed to each of the accelerator cards MC0, MC1, MC2, and MC3, so that all of the accelerator cards have data (6,7) stored therein, thereby completing the reduction operation in one accelerator cell.

The accelerating unit shown in fig. 4 can form two independent loops, and each loop can complete the protocol operation of half of the data, thereby accelerating the operation speed and improving the operation efficiency.

In addition, when the acceleration unit performs the protocol operation, the acceleration unit can also realize the concurrent calculation of a plurality of acceleration cards, thereby accelerating the operation speed. For example, accelerator card MC0 has data (0,0) stored therein, accelerator card MC1 has data (1,2) stored therein, accelerator card MC2 has data (3,1) stored therein, and accelerator card MC3 has data (2,4) stored therein. Part of data (0) in the accelerator card MC0 can be transferred to the accelerator card MC1, a result (1) is obtained after addition, part of data (2) in the accelerator card MC1 can be synchronously transferred to the accelerator card MC2, and a result (3) is obtained after addition, so that concurrent operation of the accelerator card MC1 and the accelerator card MC2 can be realized; and by analogy, the whole specification operation is completed.

The concurrent computation may further include performing an addition operation by the group of acceleration units, and performing a reduction operation on the operation result of the group of acceleration units and the operation result of another group of acceleration units. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data (2,4), and the data in the accelerator card MC0 can be transferred to the accelerator card MC1 for operation to obtain a first set of results (1, 2); synchronously or asynchronously, the data in the accelerator card MC2 may be passed into the accelerator card MC3 for operation to obtain a second set of results (5, 5). And then, the first group of results and the second group of results are operated to obtain final reduction results (6, 7).

Similarly, in addition to performing the reduction operation in one acceleration unit, the reduction operation may also be performed in an acceleration component or acceleration device. It should be understood that the acceleration device may also be considered an end-to-end connected acceleration assembly.

When the reduction operation is performed in the acceleration component or the acceleration device, the method may include: performing first specification operation on data in an accelerator card of the same accelerator unit to obtain a first specification result in each accelerator unit; and carrying out second reduction operation on the first reduction results in the multiple accelerating units to obtain second reduction results.

Also taking the reduction summation operation as an example, the first step is described above, and for an acceleration apparatus including a plurality of acceleration units, a local reduction operation may be performed in each acceleration unit first, and after the reduction operation in each acceleration unit is completed, an accelerator card in the same acceleration unit will obtain a result of the local reduction operation, which is referred to as a first reduction result.

Next, the first reduction results in all acceleration units may be passed and added in the adjacent acceleration units. Thus, similar to the scaling operation performed in one acceleration unit, the first acceleration unit passes the first scaling result to the second acceleration unit, and after the addition operation is performed in the accelerator card of the second acceleration unit, the result is passed and added. After the last addition, the final result is conducted to each acceleration unit.

It should be noted that since the accelerating assemblies above are not necessarily connected end to end, the final result may be conducted in reverse, rather than in a circular fashion as would be the case if the accelerating units were connected end to end, in the case of conducting the final result to each accelerating unit. The technical solution of the present disclosure is not particularly limited to how to conduct the final result.

Still further, according to an embodiment of the present disclosure, the acceleration apparatus may be further configured to perform a reduction operation, including: performing first specification operation on data in the accelerator cards of the same accelerator unit to obtain a first specification result; performing intermediate specification operation on first specification results in a plurality of acceleration units of the same acceleration assembly to obtain an intermediate specification result; and carrying out second reduction operation on the intermediate reduction results in the plurality of acceleration components to obtain a second reduction result.

In this embodiment, the reduction operation may be performed in the same acceleration unit first, which has already been described above and will not be described here again.

Then, a specification operation can be performed in each acceleration component, so that each acceleration card in each acceleration component obtains a local specification result in the acceleration component; and then, taking the acceleration assembly as a unit, and carrying out specification operation in the multiple acceleration assemblies, so that each acceleration card acquires a global specification result in the acceleration device.

It is to be understood that the above-described transfer order is merely for convenience of description, and is not necessarily such a transfer order. Fig. 26 is a schematic structural diagram of a combined processing device in an embodiment of the disclosure, and as shown in the drawing, the combined processing device 2600 may include an acceleration unit 2601, which may be the acceleration unit shown in fig. 1 to 5. In addition, the combined processing device may also include an interconnect interface 2602 and other processing devices 2603. The acceleration unit 2601 according to the present disclosure may interact with other processing devices 2603 through the interconnection interface 2602 to collectively perform operations designated by a user.

According to aspects of the present disclosure, the other processing devices may include one or more types of processors, for example, a micro processing unit (MCU), a substrate controller (BMC), a central processor, and the like, and the number thereof may be determined not by limitation but according to actual needs. In one or more embodiments, the other processing device may serve as an interface for the acceleration unit of the present disclosure to external data and control, and perform basic control including, but not limited to, data handling, completing start, stop, etc. of the acceleration unit; other processing devices can cooperate with the acceleration unit to complete the operation task.

Optionally, the combined processing device 2600 may further include a storage device 2604, which may be connected with the acceleration unit 2601, the interconnection interface 2602, and the other processing devices 2603, respectively. In one or more embodiments, the storage device 2604 may be used to store data for the acceleration unit 2601 and other processing devices 2603, particularly data that may not be stored in its entirety within the acceleration unit 2601 and other processing devices 2603 or on-chip storage.

In some application scenarios, the combined processing device 2600 of the present disclosure may be used in, for example, a large-scale data center, a supercomputing center, a cloud computing center, etc., and can construct a high-performance processor cluster, thereby implementing real-time processing of mass data.

In some embodiments, the present disclosure also discloses a circuit board, which may include the acceleration unit. Referring to fig. 27, an exemplary circuit board 2700 is provided, wherein the circuit board 2700 may include other accessories in addition to the one or more acceleration units 2706 (two are shown for example), including but not limited to: a memory device 2701, interface means 2707 and a control device 2705.

The memory device 2701 may be connected to the acceleration unit 2706 via a bus for storing data. The memory device 2701 may include a plurality of sets of memory cells 2702. Each set of the memory units 2702 may be connected to the acceleration unit 2706 via a bus. It is understood that each set of the memory units 2702 can be at least one of a DDR SDRAM (Double Data Rate SDRAM), an HBM (high bandwidth memory), and the like.

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device 2701 may include 4 sets of the memory cells 2702. Each set of memory cells 2702 may include multiple DDR4 pellets (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check.

In one embodiment, each group of the memory cells 2702 may include a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. A controller for controlling DDR is provided in the acceleration unit 2706, and is used to control data transfer and data storage of each memory unit. The interface 2707 may be connected to the acceleration unit 2706. The interface means 2707 are used for enabling data transmission between the acceleration unit 2706 and an external device 2708, such as a server or a computer. For example, in one embodiment, the interface 2707 may be a standard PCIE interface. For example, the data to be processed is transmitted from the server to the acceleration unit 2706 through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface 2707 may be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface can implement a switching function. In addition, the calculation result of the acceleration unit 2706 can still be transmitted back to an external device (e.g., a server) by the interface apparatus 2707. The control device 2705 may be connected to the acceleration unit 2706. The control device 2705 may be used to monitor the status of the acceleration unit 2706. Specifically, the acceleration unit 2706 and the control device 2705 may be electrically connected through an SPI interface. The control device 2705 may include a single chip Microcomputer (MCU).

In some embodiments, the present disclosure also discloses an electronic device or apparatus including the acceleration unit. In some embodiments, the present disclosure also discloses yet another electronic device or apparatus that includes the acceleration assembly described above. In some embodiments, the present disclosure also discloses another electronic device or apparatus, which includes the acceleration apparatus. In some embodiments, the present disclosure also discloses yet another electronic device or apparatus that includes the above circuit board.

According to different application scenarios, the electronic device or apparatus may include, for example, a data processing apparatus, a data center, a super computing center, a cloud computing center, a server, a cloud server, and the like.

In the above embodiments of the present disclosure, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments. The technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing may be better understood in light of the following clauses:

clause 1. an acceleration unit comprising M of the present unit acceleration cards, each of which includes an internal connection port, each of which is connected to another present unit acceleration card through the internal connection port, wherein,

the M accelerating cards of the unit are logically formed into an accelerating card matrix with the size of L × N, and L and N are integers not less than 2.

Clause 2. the acceleration unit of clause 1, wherein the M of the present unit acceleration cards are logically formed as a matrix of acceleration cards of 2 x 2, 3 x 3, or 4 x 4.

Clause 3. the acceleration unit of

clause

1 or 2, wherein each of the unit acceleration cards is connected to at least one other of the unit acceleration cards via two paths.

Clause 4. the acceleration unit of any of clauses 1-3, wherein the accelerator cards of the diagonal corner local units in the accelerator card matrix are connected by two paths.

Clause 5. the acceleration unit of any of clauses 1-4, wherein at least one of the M local unit acceleration cards includes an external port.

Clause 6. the acceleration unit according to any one of clauses 1 to 5, wherein, when four of the unit acceleration cards are included in the acceleration unit, each of the unit acceleration cards includes six ports, and wherein the four ports of each of the unit acceleration cards are internal connection ports for connecting with the other three of the unit acceleration cards; the other two ports of at least one acceleration card of the unit are external ports and are used for being connected with an external unit acceleration card.

Clause 7. the acceleration unit of any of clauses 1-6, wherein the inbound and outbound ports are SerDes ports.

Clause 8. the acceleration unit of any of clauses 1-7, wherein the acceleration unit is configured to perform a specification operation on data in an acceleration card of the acceleration unit to obtain a specification result.

Clause 9. the acceleration unit of clause 4, wherein the accelerator card matrix includes two independent rings, each ring performing a portion of the operations.

Clause 10. an electronic device including the acceleration unit of any one of clauses 1-9.

It should be understood that the terms "first," "second," "third," and "fourth," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

Claims

1. An accelerator unit comprises M accelerator cards of the unit, each accelerator card of the unit comprises an internal connection port, each accelerator card of the unit is connected with other accelerator cards of the unit through the internal connection port, wherein,

2. The acceleration unit of claim 1, wherein the M local-unit acceleration cards are logically formed as a2 x 2, 3 x 3, or 4 x 4 matrix of acceleration cards.

3. An acceleration unit according to claim 1 or 2, wherein each local unit acceleration card is connected to at least one other local unit acceleration card via two paths.

4. An accelerator unit according to any of claims 1 to 3, wherein the accelerator cards of the diagonal unit at the four corners of the accelerator card matrix are connected by two paths.

5. An acceleration unit according to any one of claims 1-4, wherein at least one of the M present unit acceleration cards includes an external port.

6. The acceleration unit of any one of claims 1-5, wherein, when four of the present unit acceleration cards are included in the acceleration unit, each of the present unit acceleration cards includes six ports, and wherein the four ports of each of the present unit acceleration cards are internal ports for connection with the other three present unit acceleration cards; the other two ports of at least one acceleration card of the unit are external ports and are used for being connected with an external unit acceleration card.

7. The acceleration unit of any one of claims 1-6, wherein the inbound and outbound ports are SerDes ports.

8. The acceleration unit of any of claims 1-7, wherein the acceleration unit is configured to perform a specification operation on data in an acceleration card of the acceleration unit to obtain a specification result.

9. The acceleration unit of claim 4, wherein the accelerator card matrix comprises two independent rings, each ring performing a portion of the operations.

10. An electronic device comprising an acceleration unit as claimed in any of the claims 1-9.