CN114764488A

CN114764488A - Method and equipment for multiplying data matrix in accelerator card matrix

Info

Publication number: CN114764488A
Application number: CN202110055601.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2022-07-19

Abstract

The present disclosure provides a method and apparatus for performing multiplication operations of a data matrix in an accelerator card matrix, which may be implemented in a computing device, wherein the computing device may be included in a combined processing device, which may also include a universal interconnect interface and other processing devices. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for data of the computing device and the other processing device.

Description

Method and equipment for multiplying data matrix in accelerator card matrix

Technical Field

The present disclosure relates to the field of computers, and more particularly, to the field of computer communications.

Background

Currently, with the rapid development of Artificial Intelligence (AI) and Machine Learning (Machine Learning), the demand for ultra-high performance processors will be greater and greater in the future, and meanwhile, the demand for data processing in the big data era is higher. The high-performance processor and the cluster need to complete real-time processing of mass data, complete training and reasoning of a complex model within a specified time, and the like. ASIC (application Specific Integrated Circuit) is a dedicated acceleration chip that can be used to train deep neural networks. The ASIC can complete the work in a shorter time, much less data center infrastructure than non-parallel processing supercomputers use.

However, when a large amount of data is encountered, a single ASIC is still powerful and inevitably incompletable, and in order to obtain a stronger calculation power, a common scheme adopts a plurality of ASIC acceleration chips. However, for multi-card networks formed by interconnection of a plurality of ASICs, the ultra-high data throughput poses a significant challenge to the data transmission bandwidth of the ASICs. Therefore, how to design an interconnection scheme among the chips to improve the computing power of the whole system and achieve the purpose of efficiently processing mass data becomes a key technical problem for constructing a high-performance processor cluster.

Disclosure of Invention

In order to solve the above technical problem, the present disclosure provides a method and apparatus capable of improving efficiency of matrix multiplication operation.

According to a first aspect of the present disclosure, there is provided a method of performing multiplication of a data matrix in an accelerator card matrix, wherein the accelerator card matrix includes M accelerator cards which are logically formed as an accelerator card matrix of L × N scale, L and N are integers not less than 2, and are communicably connected between adjacent accelerator cards, the data matrix includes a first data matrix and a second data matrix; the method comprises the following steps: splitting a first data matrix into a plurality of first subdata matrixes, and respectively storing the plurality of first subdata matrixes in the plurality of accelerator cards; splitting a second data matrix into a plurality of second sub data matrixes, and respectively storing the plurality of second sub data matrixes in the plurality of accelerator cards; and carrying out multiplication operation of a data matrix by transmitting the first sub data matrix and the second sub data matrix in the accelerator card matrix.

According to a second aspect of the present disclosure, there is provided an electronic device comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium comprising computer-executable instructions which, when executed by one or more processors, perform the method as described above.

In the scheme disclosed by the disclosure, the accelerator card matrix is composed of a plurality of accelerator cards, and for the plurality of accelerator cards, each accelerator card is connected with other accelerator cards through an internal port of the accelerator card to realize interconnection among the accelerator cards, so that the computing capacity of the accelerator unit can be effectively improved, and the speed of processing mass data is favorably improved. In addition, for the acceleration assembly and the acceleration device, the time delay of the whole system can be minimized through the interconnection mode among the acceleration units, the requirement of the system on real-time performance while processing mass data can be met to the maximum extent, and the method is favorable for improving the computing capacity of the whole system and achieving the purpose that the system processes mass data at a high speed. More specifically, efficient operation of matrix multiplication can be realized.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar or corresponding parts and in which:

FIG. 1a is a schematic diagram illustrating a structure of an acceleration unit according to an embodiment of the present invention

FIG. 1b, FIG. 2, FIG. 3, FIG. 4 and FIGS. 5 a-5 c are schematic structural diagrams of an accelerating unit according to an embodiment of the disclosure;

6-11 are various schematic views of an acceleration assembly according to an embodiment of the present disclosure;

12 a-12 c are schematic diagrams of acceleration components represented as a network topology;

FIG. 13 is a schematic view of an acceleration device including multiple acceleration units according to an embodiment of the present disclosure;

FIG. 14 is a diagram illustrating a network topology corresponding to an acceleration device in one embodiment;

FIG. 15 is a schematic diagram of a network topology corresponding to an acceleration device in another embodiment;

16-20 are schematic diagrams of an acceleration device including acceleration assemblies according to embodiments of the present disclosure;

FIG. 21 is a schematic diagram of a network topology of another accelerating device;

FIG. 22 is a schematic diagram of a matrix network topology based on wireless expansion of an acceleration device;

FIG. 23 is a schematic view of an accelerator apparatus according to yet another embodiment of the disclosure;

FIG. 24 is a schematic diagram of a network topology of yet another acceleration device;

FIG. 25 is a schematic diagram of a network topology of yet another acceleration device;

FIG. 26 is a schematic view of a combination device according to an embodiment of the present disclosure;

fig. 27 is a schematic structural diagram of a board card in an embodiment of the disclosure;

FIG. 28 is a schematic diagram of an exemplary matrix multiplication;

FIG. 29 illustrates a flow diagram of a method of performing a multiplication operation of a matrix of data in an accelerator card matrix according to one embodiment of the disclosure;

FIG. 30 illustrates an example of a multiplication operation performed by a first data matrix and a second data matrix according to one embodiment of the present disclosure;

fig. 31 shows a schematic diagram of an exemplary 3 x 3 scale accelerator card matrix;

FIG. 32a is a schematic diagram illustrating the transfer of sub-data matrices on a logical level;

FIG. 32b is the schematic diagram of the first sub-data matrix Aij and the second sub-data matrix Bij being transferred in the accelerator card matrix;

FIGS. 33a and 33b are diagrams illustrating the transfer of sub-data matrices on a logical level;

FIG. 33c is a diagram illustrating the transmission of the first sub-data matrix Aij and the second sub-data matrix Bij in the accelerator card matrix;

FIGS. 34a and 34B are schematic diagrams illustrating a first data matrix A, a second data matrix B and an accelerator card matrix P;

fig. 35a to 35d show schematic diagrams of the transfer of sub-data matrix at the logical level according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 28 shows an exemplary matrix multiplication diagram.

As shown in fig. 28, two first matrices a and two second matrices B are shown, both of 5 x 5 size, which are multiplied to obtain the resulting matrix C.

The mathematical representation of the matrix multiplication is:

c [ i ] [ j ] ═ sum (a [ i ] [ k ] × B [ k ] [ j ]) where k is 0 … n

From the above mathematical representation, as shown in fig. 28, the sum of the products of all the elements of the 1 st row of the first matrix a with all the elements of the 1 st column of the second matrix B, respectively, results in the elements of the 1 st row and 1 st column of the resulting matrix C, as highlighted in gray in fig. 28. Matrix multiplication is a common operation in the field of matrix computation and will not be described in detail here.

Fig. 29 illustrates a flow diagram of a method of performing multiplication operations of data matrices in an accelerator card matrix, wherein the accelerator card matrix includes M accelerator cards logically formed as an accelerator card matrix of size L x N, L and N being integers no less than 2, and adjacent accelerator cards are communicably connected with each other, the data matrices including a first data matrix and a second data matrix, according to an embodiment of the disclosure; the method comprises the following steps: in operation S2910, splitting the first data matrix into a plurality of first sub-data matrices, and storing the plurality of first sub-data matrices in the plurality of accelerator cards, respectively; in operation S2920, splitting a second data matrix into a plurality of second sub-data matrices, and storing the plurality of second sub-data matrices in the plurality of accelerator cards, respectively; in operation S2930, a multiplication operation of a data matrix is performed by passing the first sub data matrix and the second sub data matrix in the accelerator card matrix.

First, various embodiments of an accelerator card matrix will be described in detail below with reference to the accompanying drawings. The accelerator card matrix is composed of a plurality of accelerator cards capable of communicating with each other, the accelerator cards can be connected through different communication paths in a communication mode, when the accelerator cards start from one accelerator card to reach another accelerator card, the accelerator cards can reach through the communication paths, and therefore different communication topologies are formed. It should be understood that the connection in the following description refers to a communication connection, i.e., each accelerator card can communicate with each other and transmit data.

Furthermore, the accelerator card matrix described above may be formed as an acceleration unit, an acceleration assembly, an acceleration device, or the like. It should be understood that in this context, although different terms are used depending on the particular scenario, they are essentially systems that include multiple accelerator cards.

FIG. 1a is a schematic diagram illustrating a structure of an acceleration unit according to an embodiment of the present invention. According to an embodiment of the present disclosure, the accelerator card matrix may include one accelerator unit, and the accelerator unit may include M local accelerator cards, each local accelerator card including an internal connection port, and each local accelerator card being connected to another local accelerator card through the internal connection port, where the M local accelerator cards are logically formed as an accelerator card matrix of L × N scale, and L and N are integers not less than 2.

As shown in fig. 1a, an accelerator card matrix may be formed by a plurality of accelerator cards, which are connected to each other to enable data or command transfer and communication. For example, accelerator cards MC00 through MC0N form row 0 of the accelerator card matrix, accelerator cards MC10 through MC1N form row 1 of the accelerator card matrix, and so on, accelerator cards MCL0 through MCLN form row L of the accelerator card matrix.

It is to be understood that, for the sake of context understanding, an accelerator card in the same acceleration unit is referred to as "own unit accelerator card", and accelerator cards in other acceleration units are referred to as "external unit accelerator cards". Such designations are merely for convenience of description and do not limit the technical aspects of the present disclosure.

Each accelerator card may have a plurality of ports, and these ports may be connected to the accelerator card of the present unit, or may be connected to an accelerator card of an external unit. In the present disclosure, a connection port between the present unit accelerator cards may be referred to as an internal connection port, and a connection port between the present unit accelerator cards and the external unit accelerator cards may be referred to as an external connection port. It is to be understood that the external port and the internal port are merely for convenience of description, and the same port may be used for both. This will be described below.

It is to be understood that M may be any integer, and that M accelerator cards may be formed into a1 × M or M × 1 matrix, or M matrices may be formed into other types of matrices. The acceleration unit of the present disclosure is not limited to a specific matrix size and form.

Furthermore, the accelerator cards, such as the accelerator card of the unit, and the accelerator card of the unit and the accelerator card of the external unit, can be connected through a single or multiple communication paths. This will be described in detail later.

It should also be understood that, in the context of the present disclosure, although the positions between the accelerator cards are all described in terms of a rectangular network, in practice, the matrix formed is not necessarily in a matrix form in terms of physical spatial arrangement, but may be in any position, for example, the accelerator cards may form a straight line or the accelerator cards may be irregularly arranged. The matrix described above is only logical as long as the connections between the accelerator cards form a matrix relationship.

According to one embodiment of the present disclosure, M may be 4, whereby 4 present-unit accelerator cards may be logically formed as a2 x 2 accelerator card matrix; m may be 9, whereby 9 of the present unit accelerator cards may be logically formed as a3 x 3 accelerator card matrix; m may be 16, whereby 16 present-unit accelerator cards may be logically formed as a 4 x 4 accelerator card matrix. M may also be 6, whereby 6 present unit accelerator cards may be logically formed as a2 x 3 or 3 x 2 accelerator card matrix; m may also be 8, whereby 8 present-unit accelerator cards may be logically formed as a2 x 4 or 4 x 2 accelerator card matrix.

According to one embodiment of the present disclosure, each of the unit accelerator cards is connected to at least one other of the unit accelerator cards via two paths.

In the topology described in this disclosure, two local unit accelerator cards may be connected via a single communication path, or may be connected via multiple (e.g., two) paths, as long as the number of ports is sufficient. The connection through the plurality of communication paths is beneficial to ensuring the reliability of communication between the accelerator cards and is beneficial to forming different topological structures. This will be explained and described in more detail in the examples below.

According to one embodiment of the present disclosure, the accelerator cards of the diagonal local unit at the four corners in the accelerator card matrix are connected by two paths. For a matrix, it may be preferable to connect two pairs of accelerator cards located at opposite corners of the matrix, and for some topologies, connecting the accelerator cards located at diagonal positions may help form two complete communication loops. This will be explained and described in more detail in the examples below.

More specifically, according to one embodiment of the present disclosure, at least one of the unit accelerator cards may include an external port. For example, each acceleration unit may include four present unit acceleration cards, each present unit acceleration card may include six ports, and four ports of each present unit acceleration card are internal ports for connecting with three other present unit acceleration cards; the other two ports of at least one acceleration card of the unit are external ports and are used for being connected with an external unit acceleration card.

It should be understood that four of the six ports of each accelerator card of the present unit may be used to connect the accelerator card of the present unit, and the two ports that are left vacant may be used to connect the accelerator cards in the other accelerator units. These spare ports may also be free ports, not connected to any external device, or connected directly or indirectly to other devices or ports.

For purposes of example and simplicity, the acceleration unit, acceleration component acceleration arrangement, and electronic device are described below with each acceleration unit including four accelerator cards. It should be understood that each acceleration unit may include a greater or lesser number of acceleration cards.

For convenience of description, the acceleration unit may include four accelerator cards, that is, a first accelerator card, a second accelerator card, a third accelerator card, and a fourth accelerator card, where each accelerator card is provided with an internal port and an external port, and each accelerator card is connected to the other three accelerator cards through the internal port.

FIG. 1b is a schematic diagram of an acceleration unit according to an embodiment of the present disclosure. The acceleration unit 100 includes four accelerator cards, which are accelerator card MC0, accelerator card MC1, accelerator card MC2, and accelerator card MC 3. For four accelerator cards, each accelerator card may include an external port and an internal port, the internal port of the accelerator card MC0 is connected to the internal ports of the accelerator cards MC1, MC2 and MC3, the internal port of the accelerator card MC1 is connected to the internal ports of the accelerator cards MC2 and MC3, and the internal port of the accelerator card MC2 is connected to the internal port of the accelerator card MC3, that is, the internal port of each accelerator card is connected to the internal ports of the other three accelerator cards. Information interaction among the four accelerator cards can be realized through interconnection of internal connection ports of the four accelerator cards. The embodiment of the disclosure utilizes the interconnection among the four accelerator cards in the accelerator unit, can improve the computing power of the accelerator unit and achieve the purpose of processing mass data at high speed, and enables the path between each accelerator card and other accelerator cards to be shortest and the communication delay to be lowest.

As described above, the number of accelerator cards in the present disclosure may not be limited to four, but may be other numbers. For example, in one embodiment, the number N of the accelerator cards is equal to 3, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with the other two accelerator cards through the internal port to realize interconnection among the three accelerator cards. In another embodiment, the number N of the accelerator cards is equal to 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with the other four accelerator cards through the internal port to realize interconnection among the five accelerator cards, thereby improving the computing power of the accelerator unit and realizing high-speed processing of mass data. In another embodiment, the number N of the accelerator cards is greater than 5, each accelerator card is provided with an internal port and an external port, and each accelerator card is connected with all other accelerator cards through the internal port, so that interconnection among the N accelerator cards is realized, and high-speed processing of mass data is realized.

Based on the acceleration unit 100 provided in fig. 1b, further, each acceleration card and at least one other acceleration card may be connected through two paths. Specifically, there may be, for example, three connection modes: the first connection mode is that each accelerator card can be connected with one of the other three accelerator cards through two paths; the second way is that each accelerator card can be connected with two accelerator cards in the other three accelerator cards through two paths; a third way is that each accelerator card can be connected to three other accelerator cards by two paths, in which case it is not excluded that there are more ports per accelerator card. To facilitate understanding of the connection manner between the two paths, the first connection manner will be taken as an example and described in conjunction with fig. 2.

Fig. 2 is a schematic diagram of an accelerating unit according to another embodiment of the disclosure. In the acceleration unit 200 shown in fig. 2, each acceleration card and at least one other acceleration card may be connected by two paths, for example, the illustrated acceleration card MC0 and acceleration card MC2 may be connected by two paths, and the illustrated acceleration card MC1 and acceleration card MC3 may be connected by two paths. According to the arrangement, two links (or paths) for information interaction between the two accelerator cards can be provided, so that when one link fails, the other link is connected between the two accelerator cards, and the safety of the accelerator unit can be effectively improved.

While the connection between the acceleration unit and the plurality of acceleration cards is described above with reference to fig. 1 and 2, it will be understood by those skilled in the art that the above description is exemplary and not limiting, for example, the arrangement of the acceleration cards in the acceleration unit may not be limited to the form shown in fig. 1 and 2, and in one embodiment, the four acceleration cards of the acceleration unit may be logically arranged in a quadrilateral arrangement, as will be described below with reference to fig. 3.

Fig. 3 is a schematic structural diagram of an acceleration unit according to another embodiment of the present disclosure. In the acceleration unit 300 shown in FIG. 3, the four accelerator cards MC0, MC1, MC2, and MC3 may be logically arranged in a quadrilateral arrangement, and the four accelerator cards may occupy four vertex positions of the quadrilateral arrangement. The lines among the accelerator cards MC0, MC1, MC2 and MC3 are in a quadrilateral shape, so that the arrangement of the lines is clearer, and the arrangement of the lines is convenient. It should be noted that the four accelerator cards shown in fig. 3 are arranged in a rectangular or 2 × 2 matrix, but this is a logic interconnection diagram, and for convenience of description, they are drawn in a rectangular form, and specific quadrangles may be freely arranged, such as parallelograms, trapezoids, squares, and the like. In the actual layout and wiring, the four accelerator cards may be arranged arbitrarily, for example, in the actual whole machine, the four accelerator cards may be arranged in parallel in a straight line, and the sequence may be MC0, MC1, MC2, and MC 3. It should also be understood that the logical quadrangles described in the present embodiment are exemplary, and in fact, the arrangement shape of the multiple accelerator cards may vary, and the quadrangle is only one of them. For example, when the number of the accelerator cards is five, the accelerator cards may be logically arranged in a pentagon shape.

Based on the connection relationship of the acceleration unit 200 provided in fig. 2, further, referring to fig. 4, fig. 4 is a schematic diagram of a structure of an acceleration unit in another embodiment of the present disclosure. In the acceleration unit 400 shown in FIG. 4, the four accelerator cards MC0, MC1, MC2, and MC3 may be logically arranged in a quadrilateral arrangement, with the four accelerator cards occupying the four vertex positions of the quadrilateral, respectively. As further shown, the connection between the internal port of the accelerator card MC1 and the internal port of the accelerator card MC3 may be performed by two paths, and the connection between the internal port of the accelerator card MC0 and the internal port of the accelerator card MC2 may be performed by two paths. So to accelerating unit 400, not only the circuit sets up conveniently to the security has been promoted.

Fig. 5a is a schematic structural diagram of an acceleration unit according to an embodiment of the present disclosure. In the acceleration unit 500 shown in fig. 5a, the number labels on each acceleration card represent ports, and each acceleration card may include six ports, i.e., port 0, port 1, port 2, port 3, port 4, and port 5. The port 1, the port 2, the port 4 and the port 5 are internal ports, and the port 0 and the port 3 are external ports. For the four accelerator cards MC0, MC1, MC2 and MC3, 2 external ports of each accelerator card can be connected with other accelerator units for interconnection among the plurality of accelerator units. The 4 internal ports of each accelerator card can be used to interconnect with the other three accelerator cards in the present accelerator unit.

As further shown in fig. 5a, four accelerator cards may be logically arranged in a quadrilateral, for example, accelerator card MC0 and accelerator card MC2 may be in a diagonal relationship, port 2 of MC0 is connected with port 2 of MC2, and port 5 of MC0 is connected with port 5 of MC2, i.e., there may be two links between accelerator card MC0 and accelerator card MC2 for communication. Accelerator card MC1 and accelerator card MC3 may be in a diagonal relationship, with port 2 of MC1 connected to port 2 of MC3, and port 5 of MC1 connected to port 5 of MC3, i.e., there may be two links between accelerator card MC1 and accelerator card MC3 for communication.

According to the arrangement, each accelerator card is provided with two external ports and four internal ports, and in two pairs of accelerator cards in a diagonal relationship, the two accelerator cards of each pair of accelerator cards can be connected by adopting the two internal ports to form two links, so that the safety and the stability of the accelerator unit can be effectively improved. And the four accelerator cards are arranged in a quadrilateral way logically, so that the circuit layout of the whole accelerator unit is reasonable and clear, and the wiring operation in each accelerator unit is convenient. It should be further noted that, in the interconnection lines between the four accelerator cards shown in fig. 5b, the connection line between port 1 of accelerator card MC1 and port 1 of MC0, the connection line between port 2 of accelerator card MC0 and port 2 of MC2, the connection line between port 1 of accelerator card MC2 and port 1 of MC3, and the connection line between port 2 of accelerator card MC3 and port 2 of MC1 form an upright 8-shaped network, as shown in fig. 5 b. For the connection line between port 4 of accelerator card MC1 and port 4 of MC2, the connection line between port 5 of accelerator card MC2 and port 5 of MC0, the connection line between port 4 of accelerator card MC0 and port 4 of MC3, and the connection line between port 5 of accelerator card MC3 and port 5 of MC1, these four lines form a transverse 8-shaped network, as shown in fig. 5 c. The two fully-connected square networks can form a double-ring structure, and have the functions of redundancy backup and system reliability enhancement.

According to an embodiment of the present disclosure, the accelerator Card of the present disclosure may be a Mezzanine Card (MC Card for short), which may be a single circuit board. The MC card can be loaded with an ASIC chip and some necessary peripheral control circuits. The MC card may be connected to the substrate by a snap connector. Power and control signals on the substrate can be transmitted to the MC card through the buckle connector. According to another embodiment of the present disclosure, the internal port and/or the external port described in the present disclosure may be a SerDes port. For example, in one embodiment, each MC card may provide 6 bidirectional SerDes ports, each SerDes port has 8 lanes and a data transmission rate of 56Gbps, so that the total bandwidth of each port may be as high as 400Gbps, which may support massive data exchange between the accelerator card and the accelerator card, and facilitate the accelerator unit to process massive data at high speed.

SerDes, as described above, is a composite word of an english word Serializer (Serializer) and deserializer (De-Serializer), and is called a SerDes. The SerDes interface may be used to build a high performance processor cluster. The Serdes has the main functions of converting a plurality of paths of low-speed parallel signals into serial signals at a sending end, transmitting the serial signals through a transmission medium, and finally converting the high-speed serial signals into the low-speed parallel signals at a receiving end again, so that the Serdes is very suitable for the end-to-end long-distance high-speed transmission requirement. In another embodiment, the external port in the accelerator card can be connected to a QSFP-DD interface of another accelerator unit, wherein the QSFP-DD interface is an optical module interface commonly used in SerDes technology, and can be used for interconnection with other external devices in cooperation with a cable.

Further, according to another embodiment of the present disclosure, 4 accelerator cards may be mounted inside one accelerator unit, and interconnection of the 4 accelerator cards may be completed by using PCB traces. On a high-speed plate with a low dielectric constant, signal integrity can be guaranteed to the greatest extent through reasonable layout and wiring, and then the communication bandwidth among the four accelerator cards is guaranteed to tend to a theoretical value.

The acceleration unit disclosed by the disclosure has the advantages that for four acceleration cards, each acceleration card is connected with three other acceleration cards through the internal ports of the acceleration card, each acceleration card can directly communicate with three other acceleration cards, and the fully-connected network topology (fully-connected quad) is adopted as the communication architecture, so that the path between each acceleration card and the other acceleration cards is shortest, the total Hop number is minimum, and the delay is minimum. The present disclosure describes the time delay of the system in terms of Hop, which represents the number of hops in the communication, i.e., the number of communications. Hop specifically represents the shortest path from one node, from the initial node back to the initial node after traversing all nodes in the network. The 4 accelerator cards are interconnected, the formed fully-connected square network topology has the shortest delay, and a double-ring structure formed by interconnecting two diagonal accelerator cards can improve the robustness of the system, so that the service can still normally run when a single accelerator card fails. When various arithmetic logic operations are carried out, each ring in the double-ring structure can respectively complete a part of operations, so that the overall operation efficiency is improved, and the topological bandwidth is maximally utilized.

While various embodiments of the acceleration unit according to the present disclosure have been described above with reference to fig. 1a to 5c, the present disclosure also discloses an acceleration assembly that may include a plurality of the above-described acceleration units, and will be described below with reference to various embodiments of the acceleration assembly.

FIG. 6 is a schematic diagram of an acceleration assembly according to an embodiment of the present disclosure. As shown in fig. 6, the acceleration assembly 600 may include n acceleration units, in other words, the acceleration card matrix may be embodied as An acceleration assembly, which includes a plurality of acceleration units, i.e., An acceleration unit a1, An acceleration unit a2, An acceleration unit A3, An acceleration unit An, wherein the acceleration unit a1 and the acceleration unit a2 are connected through An external port, and the acceleration unit a2 and the acceleration unit A3 are connected through An external port, i.e., each acceleration unit is connected through An external port of the acceleration unit. In one embodiment, the external port of the acceleration card MC0 in acceleration unit a1 may be connected to the external port of the acceleration card MC0 in acceleration unit a2, and the external port of the acceleration card MC0 in acceleration unit a2 may be connected to the external port of the acceleration card MC0 in acceleration unit A3, i.e., each acceleration unit is connected via the external port of the acceleration card MC 0.

It will be appreciated by those skilled in the art that the connection between the acceleration units in the present disclosure may not be limited to the connection of the external port of the acceleration card MC0, but may also include, for example, one or more of the connection of the external port of the acceleration card MC1, the connection of the external port of the acceleration card MC2, and the connection of the external port of the acceleration card MC 3. That is, in the present disclosure, the connection manner of the acceleration unit a1 and the acceleration unit a2 may include: the external port of the MC0 in the A1 is connected with the external port of the MC0 in the A2, the external port of the MC1 in the A1 is connected with the external port of the MC1 in the A2, the external port of the MC2 in the A1 is connected with the external port of the MC2 in the A2, and the external port of the MC3 in the A1 is connected with the external port of the MC3 in the A2. Similarly, the connection of the acceleration unit a2 and the acceleration unit A3 may include: the external port of the MC0 in the A2 is connected with the external port of the MC0 in the A3, the external port of the MC1 in the A2 is connected with the external port of the MC1 in the A3, the external port of the MC2 in the A2 is connected with the external port of the MC2 in the A3, and the external port of the MC3 in the A2 is connected with the external port of the MC3 in the A3. And so on to the connection of acceleration unit An-1 to acceleration unit An. It should be noted that the above description is exemplary, and for example, the connection between different acceleration units may not be limited to the connection of the acceleration card corresponding to the reference number, and may be set as the connection of the acceleration card corresponding to the reference number according to the requirement.

It should be noted that, in fig. 6, n acceleration units are shown, where n is greater than 3, but the number of acceleration units may not be limited to be greater than 3 in the drawing, and may also be set to be, for example, 2 or 3, the connection relationship between two acceleration units is the same as or similar to the connection relationship between the acceleration units a1 and a2, and the connection relationship between three acceleration units is the same as or similar to the connection relationship between the acceleration units a1, a2, and A3, which are not described herein again.

In addition, the structures of the plurality of accelerating units in the accelerating assembly may be the same or different, and the structures of the plurality of accelerating units are shown to be the same in fig. 6 for convenience of illustration, but in practice, the structures of the plurality of accelerating units may be different. For example, the layout of a plurality of accelerator cards in some accelerator units is polygonal, the layout of a plurality of accelerator cards in some accelerator units is in a line, the plurality of accelerator cards in some accelerator units are connected by a line, the plurality of accelerator cards in some accelerator assemblies are connected by two links, etc., some accelerator units include four accelerator cards, some accelerator units include three or five accelerator cards, etc., that is, the structure of each accelerator unit can be set independently, and the structures of different accelerator units can be the same or different.

The acceleration assembly disclosed by the disclosure not only can interconnect the acceleration cards in the acceleration units in the acceleration assembly, but also can interconnect the acceleration cards of different acceleration units, so that a hybrid three-dimensional network can be constructed. According to the arrangement, each acceleration card can process data and share the data through the interconnection among the acceleration units, and the data sharing can directly acquire the data, so that the data propagation path and time are reduced, and the data processing efficiency is improved.

FIG. 7 is a schematic view of an acceleration assembly according to another embodiment of the present disclosure. As shown in fig. 7, the acceleration assembly 700 may include n acceleration units, i.e., An acceleration unit a1, An acceleration unit a2, An acceleration unit A3, An acceleration unit An, and a plurality of acceleration units in the acceleration assembly 700 may be logically in a multi-layer structure (shown by dotted lines in the figure), each layer may include one acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card of another acceleration unit through An external port. Thus, each acceleration card can share data through a high-speed serial link while processing data through high-speed operation by layer progressive configuration combination, infinite interconnection of the acceleration cards is realized, customizable computing power requirements are met, and flexible configuration of processor cluster hardware computing power is realized. As further shown in the figure, the acceleration units of each layer may include four acceleration cards, and the acceleration units may be logically arranged in a quadrilateral arrangement, and the four acceleration cards are respectively arranged at four vertex positions of the quadrilateral.

It should be understood by those skilled in the art that the acceleration assembly described above in connection with fig. 7 is exemplary and not limiting. For example, the structures of the plurality of acceleration units may be the same or different. The number of layers of the accelerating component can be 2, 3, 4 or more than 4, and the number of layers can be freely set according to needs. The number of connection paths between two connected accelerating units can be 1,2, 3 or 4 for each two connected accelerating units. For ease of understanding, the following exemplary description will be made in conjunction with fig. 8-12.

FIG. 8 is a schematic structural diagram of an acceleration assembly according to yet another embodiment of the present disclosure. As shown in fig. 8, the number of acceleration units in the acceleration component 701 may be 2, and two acceleration units are connected through a path, specifically, the acceleration unit a1 and the acceleration unit a2 may be implemented by connecting an external port of the acceleration card MC0 in the acceleration unit a1 with an external port of the acceleration card MC0 in the acceleration unit a2, for example.

As shown in fig. 9, the number of acceleration units in the acceleration assembly 702 may be 2, two acceleration units are connected through two paths, an external port of the acceleration card MC0 in the acceleration unit a1 is connected to an external port of the acceleration card MC0 in the acceleration unit a2, and an external port of the acceleration card MC1 in the acceleration unit a1 is connected to an external port of the acceleration card MC1 in the acceleration unit a 2. Therefore, when one path fails, the other path supports the communication between the acceleration units, and the safety of the acceleration assembly is further improved.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an acceleration assembly according to another embodiment of the disclosure. As shown in fig. 10, in the acceleration component 703, the number of acceleration units may be 2, two acceleration units are connected by three paths, the external port of the acceleration card MC0 in the acceleration unit a1 is connected with the external port of the acceleration card MC0 in the acceleration unit a2, the external port of the acceleration card MC1 in the acceleration unit a1 is connected with the external port of the acceleration card MC1 in the acceleration unit a2, and the external port of the acceleration card MC2 in the acceleration unit a1 is connected with the external port of the acceleration card MC2 in the acceleration unit a 2. Thus, even when two paths are in failure, the other path supports communication between the acceleration units, and the safety of the acceleration assembly is further improved.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an acceleration assembly according to another embodiment of the disclosure. In the acceleration component 704 shown in fig. 11, the number of acceleration units may be 2, and two acceleration units may be connected by four paths, for example, the external port of the acceleration card MC0 in the acceleration unit a1 is connected to the external port of the acceleration card MC0 in the acceleration unit a2, the external port of the acceleration card MC1 in the acceleration unit a1 is connected to the external port of the acceleration card MC1 in the acceleration unit a2, the external port of the acceleration card MC2 in the acceleration unit a1 is connected to the external port of the acceleration card MC2 in the acceleration unit a2, and the external port of the acceleration card MC3 in the acceleration unit a1 is connected to the external port of the acceleration card MC3 in the acceleration unit a 2. Thus, even when three paths fail, the other path supports communication between the acceleration units, and the safety of the acceleration assembly is further improved.

FIG. 12a is a schematic diagram of an acceleration component represented as a network topology. As shown in fig. 12a, acceleration component 705 may include two acceleration units, each acceleration unit may include four acceleration cards, there may be two links between acceleration card MC1 and acceleration card MC3 in each acceleration unit, and there may be two links between acceleration card MC0 and acceleration card MC 2. The acceleration arrangement 705 of the left diagram of fig. 12a may form a three-dimensional representation as shown in the right diagram. In the right diagram of fig. 12a, circles represent accelerator cards, lines represent link connections, numeral 0 represents accelerator card MC0, numeral 1 represents accelerator card MC1, numeral 2 represents accelerator card MC2, and numeral 3 represents accelerator card MC 3. The right hand figure shows the acceleration component 705, but as another expression, a form of network topology is shown. The numbers embedded in the vertical lines in the right drawing indicate the port numbers of the connections, and for example, MC0, MC1, MC2, MC3 and MC3 in the two acceleration units are connected by port 0, port 3, and port 3, respectively.

For the right graph in fig. 12a, one acceleration unit is considered as one node, and two nodes have 8 acceleration cards, i.e. two nodes constitute a so-called 8-card interconnect. The one-machine four-card interconnection relationship inside each node is definite, when two nodes are interconnected, MC0 and MC1 in an upper node (namely an acceleration unit A1) are respectively connected with MC0 and MC1 of a lower node (namely an acceleration unit A2) through a port 0; the MC2 and MC3 of the upper node are connected to the MC2 and MC3 of the lower node through the port 3, respectively, and this node topology is called Hybrid Cube network topology (Hybrid Cube Mesh), i.e., the acceleration component 705 is a Hybrid Cube network topology.

In the topology shown in fig. 12a with 8 cards, two separate rings may also be formed. This allows the maximum use of the topology bandwidth for the reduction operation, as shown in fig. 12b and 12c

In fig. 12b, accelerator cards MC1 and MC3 in accelerator unit a1 are connected via respective internal port 5, accelerator cards MC0 and MC2 are connected via respective internal port 5, and accelerator cards MC2 and MC3 are connected via respective internal port 1; the accelerator card MC1 in the acceleration unit a1 and the accelerator card MC1 in the acceleration unit a2 are connected via the respective external single port 0, and the accelerator card MC0 in the acceleration unit a1 and the accelerator card MC0 in the acceleration unit a2 are connected via the respective external single port 0. Thus, a separate loop is formed in the 8 cards in FIG. 12.

In fig. 12c, accelerator cards MC1 and MC3 in accelerator unit a1 are connected by respective internal port 2, accelerator cards MC0 and MC2 are connected by respective internal port 2, and accelerator cards MC0 and MC1 are connected by respective internal port 1; the accelerator card MC2 in the acceleration unit a1 and the accelerator card MC2 in the acceleration unit a2 are connected via respective external single ports 3, and the accelerator card MC3 in the acceleration unit a1 and the accelerator card MC3 in the acceleration unit a2 are connected via respective external single ports 3. Thus, another independent loop is formed in the 8 cards in fig. 12.

Only two exemplary ways of connection are shown above, but in practice the four connection paths between two accelerating elements are effectively equivalent, so any one to three of these four paths may be used to connect two accelerating elements and form a loop connection with the accelerator card within each accelerating element. And will not be described in detail herein.

FIG. 13 is a schematic view of an accelerator apparatus according to yet another embodiment of the disclosure. As shown in fig. 13, the acceleration apparatus 800 may include n acceleration units, i.e., the acceleration unit a1, the acceleration unit a2, the acceleration unit A3, the acceleration unit An, and a plurality of acceleration units in the acceleration apparatus 800 are logically in a multi-layer structure (shown by dotted lines), where the multi-layer structure may include odd layers or even layers, each layer may include one acceleration unit, and the accelerator card of each acceleration unit is connected to the accelerator card of another acceleration unit through An external port, wherein the acceleration unit a1 and the acceleration unit a2 are connected through the external port, the acceleration unit a2 and the acceleration unit A3 are connected through the external port, and the acceleration unit An-1 and the acceleration unit An are connected through the external port. And the last acceleration unit may be connected to the first acceleration unit such that the acceleration units are connected end-to-end to form a ring configuration, such as the external port of the accelerator card MC0 of acceleration unit An shown connected to the external port of the accelerator card MC0 of acceleration unit a 1. Therefore, each acceleration card can share data through a high-speed serial link while processing data through high-speed operation by layer progressive configuration combination, infinite interconnection of the acceleration cards is realized, the customizable computing power requirement is met, and flexible configuration of the computing power of processor cluster hardware is realized.

It should be noted that there are many cases of the connection relationship of the acceleration unit in the acceleration device in the present disclosure, and the detailed description has been given in the foregoing, and specific reference may be made to the description of the connection relationship of the acceleration unit in fig. 6, which is not described herein again. In addition, there are various ways for connecting the last acceleration unit with the first acceleration unit, which may specifically include: the external port of the MC0 in the acceleration unit A1 is connected with the external port of the MC0 in An, the external port of the MC1 in the acceleration unit A1 is connected with the external port of the MC1 in An, the external port of the MC2 in the acceleration unit A1 is connected with the external port of the MC2 in An, and the external port of the MC3 in the acceleration unit A1 is connected with the external port of the MC3 in An. For ease of understanding, the following exemplary description will be made in conjunction with fig. 14 and 15. In the following description, it will be understood by those skilled in the art that the acceleration device shown in fig. 14 and 15 is a plurality of embodiments of the acceleration device 800 shown in fig. 13, and thus the description related to the acceleration device 800 of fig. 13 can also be applied to the acceleration device of fig. 14 and 15.

Referring to fig. 14, fig. 14 is a schematic diagram of a network topology corresponding to the acceleration device in an embodiment. The acceleration apparatus 801 shown in fig. 14 may be composed of four acceleration units, each circle represents an acceleration card, each line represents a link connection, numeral 0 in the circle represents an acceleration card MC0, numeral 1 represents an acceleration card MC1, numeral 2 represents an acceleration card MC2, and numeral 3 represents an acceleration card MC 3; the numbers embedded in the vertical lines in the figure represent the port numbers of the connections. The last acceleration unit is connected with the first acceleration unit, and the total hop number is 5 times. Each acceleration unit is a node, 4 nodes and 16 cards can be interconnected through interconnection among the nodes, and the four acceleration units form a small cluster, and are interconnected internally, namely a super computing cluster super pod. The topology is a main push mode of a super-large scale cluster, a high-speed SerDes port is adopted, the total Hop number is 5 times, and the delay is the lowest. The manageability of the cluster is better, and the robustness is better.

Referring to fig. 15, fig. 15 is a schematic diagram of a network topology corresponding to the acceleration device in another embodiment. Fig. 15 differs from fig. 14 in that the acceleration device 802 shown in fig. 15 has a larger number of acceleration units. As can be seen from the illustration, the last acceleration unit of the acceleration arrangement 802 is connected to the first acceleration unit. According to the acceleration apparatus thus configured, the total hop number is the number of nodes plus one, that is, the total hop number is the number of acceleration units plus one.

The acceleration device including a plurality of acceleration units is exemplarily described above with reference to fig. 13 to 15, and according to the technical solution of the present disclosure, an acceleration device that may include a plurality of the aforementioned acceleration assemblies is also provided, which will be described in detail below with reference to a plurality of embodiments.

Fig. 16 is a schematic view of an acceleration device according to still another embodiment of the present disclosure, and the acceleration system of the present disclosure may be implemented as one acceleration device. The acceleration device 900 may include m acceleration modules, where each acceleration module includes a spare external port in addition to an external port required to connect acceleration units inside the acceleration module, and the acceleration modules are connected to each other through the spare external port, where the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B1 may be connected to the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B2, the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B2 may be connected to the external port of the acceleration card MC1 of the acceleration unit a1 in the acceleration module B3, and so on, and the multiple acceleration modules are connected to each other. It is to be understood that the acceleration arrangement shown in fig. 16 is exemplary and not limiting, for example, the plurality of acceleration assemblies may be identical or different in structure. Also for example, the manner of connection between different acceleration components through the spare external port may not be limited to the manner shown in fig. 16, but may also include other manners. For ease of understanding, the following exemplary description will be made in conjunction with fig. 17-25.

Based on the acceleration apparatus provided in fig. 16, further, referring to fig. 17, fig. 17 is a schematic diagram of a network topology corresponding to the acceleration apparatus in yet another embodiment, the acceleration apparatus 901 may include two acceleration components, the acceleration component B1 may include four acceleration units, the acceleration component B2 may include four acceleration units, a first acceleration unit in the acceleration component B1 is connected to a first acceleration unit in the acceleration component B2, and a last acceleration unit in the acceleration component B1 is connected to a last acceleration unit in the acceleration component B2. The total hop number for this network topology is 9. It will be understood by those skilled in the art that the network structure formed by multiple acceleration units in each acceleration component in fig. 17 is logical, and the arrangement positions of multiple acceleration units in practical application can be adjusted as required. The number of acceleration units in each acceleration assembly may not be limited to four as shown in the figure, and may be more or less as needed, for example, six, eight, etc.

Based on the acceleration device provided in fig. 16, further referring to fig. 18, fig. 18 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure, and the acceleration device 902 may include four acceleration components, i.e., acceleration components B1, B2, B3, and B4. Of the four acceleration assemblies, each acceleration assembly may include two acceleration units a1 and a2, and each acceleration assembly may be interconnected with one of a1 and a2 of the other acceleration assembly by one of the acceleration units a1 and a 2. For example, the acceleration unit a1 in the acceleration module B1 is connected to the acceleration unit a1 in the acceleration module B2, the acceleration unit a1 in the acceleration module B2 is connected to the acceleration unit a1 in the acceleration module B3, and the acceleration unit a1 in the acceleration module B3 is connected to the acceleration unit a1 in the acceleration module B4, wherein the connection is made through an external connection port of the acceleration unit.

It should be noted that the connection manner between the acceleration components may be various other than the connection manner shown in fig. 18. For example, the connection manner between the acceleration components may specifically include: the acceleration unit a1 or a2 in the acceleration module B1 is connected to the acceleration unit a1 or a2 in the acceleration module B2, the acceleration unit a1 or a2 in the acceleration module B2 is connected to the acceleration unit a1 or a2 in the acceleration module B3, and the acceleration unit a1 or a2 in the acceleration module B3 is connected to the acceleration unit a1 or a2 in the acceleration module B4.

Based on the acceleration device provided in fig. 18, further, please refer to fig. 19, and fig. 19 is a schematic diagram of an acceleration device according to another embodiment of the present disclosure. In the acceleration device 903 shown in fig. 19, each acceleration component may be connected to one of the first acceleration unit and the second acceleration unit of the other acceleration component by two paths through one of the first acceleration unit and the second acceleration unit. For example, the first acceleration unit (e.g., acceleration unit a1) of the illustrated acceleration assembly B1 and the first acceleration unit (e.g., acceleration unit a1) of the acceleration assembly B2 may be connected by two paths, the acceleration unit a1 of the acceleration assembly B2 and the acceleration unit a1 of the acceleration assembly B3 are connected by two paths, and the acceleration unit a1 of the acceleration assembly B3 and the acceleration unit a1 of the acceleration assembly B4 are connected by two paths.

Note that fig. 19 indicates two paths connected, and actually may include a case where two or more paths are connected. The connection manner between the acceleration modules may include other manners in addition to the connection manner shown in fig. 19, for example, the acceleration unit a1 or a2 in the acceleration module B1 may be connected to the acceleration unit a1 or a2 in the acceleration module B2 by using two paths, the acceleration unit a1 or a2 in the acceleration module B2 may be connected to the acceleration unit a1 or a2 in the acceleration module B3 by using two paths, and the acceleration unit a1 or a2 in the acceleration module B3 may be connected to the acceleration unit a1 or a2 in the acceleration module B4 by using two paths.

Based on the acceleration device provided in fig. 16, further please refer to fig. 20, fig. 20 is a schematic diagram of an acceleration device in yet another embodiment of the present disclosure, in which the acceleration device 904 includes four acceleration components, i.e., an acceleration component B1, an acceleration component B2, an acceleration component B3, and an acceleration component B4, each acceleration component includes two acceleration units, and each acceleration unit includes two pairs of acceleration cards. In each acceleration unit, MC0 and MC1 are a first pair of acceleration cards, and MC2 and MC3 are a second pair of acceleration cards. Wherein the second pair of accelerator cards of the accelerator unit a1 of the accelerator component B1 is connected to the second pair of accelerator cards of the accelerator unit a2 of the accelerator component B2; the first pair of accelerator cards of accelerator unit A2 of accelerator component B2 is connected to the first pair of accelerator cards of accelerator unit A1 of accelerator component B3; the second pair of accelerator cards of accelerator unit a2 of accelerator component B3 is connected to the second pair of accelerator cards of accelerator unit a1 of accelerator component B4; the first pair of accelerator cards of accelerator cell A1 of accelerator component B4 is coupled to the first pair of accelerator cards of accelerator cell A2 of accelerator component B1.

Referring to fig. 21, fig. 21 is a schematic diagram of a network topology of another acceleration device. The accelerator 905 shown in fig. 21 is an embodiment of the accelerator 904 shown in fig. 20, and thus the description above regarding the accelerator 904 can also be applied to the accelerator 905 in fig. 21. As shown in fig. 21, each acceleration component of the acceleration device 905 may form a hybrid stereo network unit, and the interconnection relationship inside each hybrid stereo network unit may be as shown in the figure, implementing interconnection of 8-node 32 cards of the acceleration device 905. The four acceleration components can realize the interconnection of multiple cards and multiple nodes through QSFP-DD interfaces and cables to form a matrix network topology.

Specifically, in this embodiment, ports 0 of the acceleration cards MC2 and MC3 of the upper node of the acceleration component B1 may be connected to acceleration cards MC2 and MC3 of the lower node of the acceleration component B2, ports 3 of MC0 and MC1 of the lower node of the acceleration component B2 may be connected to MC0 and MC1 of the upper node of the acceleration component B3, ports 0 of MC2 and MC3 of the lower node of the acceleration component B3 may be connected to MC2 and MC3 of the upper node of the acceleration component B4, and ports 3 of MC0 and MC1 of the upper node of the acceleration component B4 may be connected to MC0 and MC1 of the lower node of the acceleration component B1, respectively. The interconnection between the hybrid stereo networks arranged in this way can form two bidirectional ring structures (as described above in conjunction with fig. 5b, 5c, 12b, and 12 c), has the advantages of better reliability and safety, and the like, is suitable for deep learning training, and has high operation efficiency. For the acceleration device 905, the total Hop number is 11 times for the matrix network topology composed of 8 nodes.

Further, as shown in fig. 21, the first pair of accelerator cards and the second pair of accelerator cards in different accelerator units in the same accelerator assembly may be indirectly connected. For example, the accelerator cards MC0 and MC1 of the upper acceleration unit in the acceleration component B1 are indirectly connected with the accelerator cards MC2 and MC3 of the lower acceleration unit.

Based on the network topology of fig. 21, a matrix network topology is taken as a basic unit, which can be further expanded into a larger network topology, and fig. 22 is a schematic diagram of a matrix network topology based on wireless expansion of an acceleration device. As shown in fig. 22, the acceleration device 906 may include a plurality of acceleration components, each acceleration component (shown by a block in the figure) may include a plurality of acceleration units (a perspective view is not shown, and it can refer to the acceleration component structure in fig. 21), and each acceleration unit may include, for example, four acceleration cards in the figure, which are interconnected, so that the matrix network topology can be expanded infinitely in theory.

Based on the acceleration device provided in fig. 16, further referring to fig. 23, fig. 23 is a schematic view of an acceleration device in yet another embodiment of the present disclosure, the acceleration device 908 may include m (m 2) acceleration components, each acceleration component may include n (n 2) acceleration units, and the m acceleration components may be connected in a ring shape. The acceleration unit An of the acceleration module B1 may be connected to the acceleration unit a1 of the acceleration module B2, the acceleration unit An of the acceleration module B2 may be connected to the acceleration unit a1 of the acceleration module B3, and so on, the acceleration unit An of the acceleration module Bm may be connected to the acceleration unit a1 of the acceleration module B1, so that the m acceleration modules are connected end to end in a ring shape.

Referring to fig. 24 based on fig. 23, fig. 24 is a schematic diagram of a network topology of another acceleration apparatus, the acceleration apparatus 909 may include 6 acceleration components, each acceleration component may include two acceleration units, and the second acceleration unit of each acceleration component may be connected to the first acceleration unit of the next acceleration component, so as to form an interconnection of 12 node 48 cards, forming a larger matrix network topology, where the total Hop of the network topology is 13 times.

Referring to fig. 25 based on fig. 24, fig. 25 is a schematic diagram of a network topology of another acceleration apparatus, where the acceleration apparatus 910 includes 8 acceleration components, each acceleration component includes two acceleration units, and the second acceleration unit of each acceleration component can be connected to the first acceleration unit of the next acceleration component, so as to form an interconnection of 16 nodes and 64 cards, forming a larger matrix network topology, where the total Hop of the network topology is 17 times.

On the basis of fig. 25, the method can be extended longitudinally all the time to form a super-large-scale matrix network such as 20 node 80 cards, 24 node 96 cards and the like. Theoretically, the method can be infinitely expanded, and the total Hop number is the number of the nodes plus one. By optimizing the interconnection mode among the nodes, the time delay of the whole system can be minimized, and the requirement of the system on real-time performance while processing mass data can be met to the maximum extent.

While the above description of the acceleration device including a plurality of acceleration assemblies is provided in connection with fig. 16-25, it will be understood by those skilled in the art that the above description is illustrative and not limiting, and that, for example, the number, configuration, and connection relationship between the acceleration assemblies may be adjusted as desired. It is within the scope of the present disclosure that one skilled in the art may combine the above-described embodiments as desired to form an accelerator apparatus.

In addition, it should be noted that the accelerator card matrix, the fully-connected square network (topology), the hybrid three-dimensional network (topology), the matrix network (topology), and the like described in the present disclosure are all logical, and the specific layout form may be adjusted as needed.

The topology disclosed in this disclosure may also perform a reduction operation of data. The reduction operation can be performed in each accelerator card, each accelerator unit and in the accelerator apparatus. The specific procedure may be as follows.

Taking the reduced sum operation as an example, the process of the reduced sum operation performed in an acceleration unit may include: transferring the data stored in the first accelerator card to a second accelerator card, and adding the data originally stored in the second accelerator card and the data received from the first accelerator card in the second accelerator card; and then, the addition operation result in the second accelerator card is transmitted to the third accelerator card, and then addition operation is carried out, and so on until all the data stored in the accelerator cards are subjected to addition operation, and each accelerator card receives the final operation result.

Taking the acceleration unit shown in fig. 4 as an example, the acceleration card MC0 stores data (0,0), the acceleration card MC1 stores data (1,2), the acceleration card MC2 stores data (3,1), and the acceleration card MC3 stores data (2, 4). The data (0,0) in the accelerator card MC0 can be transferred to the accelerator card MC1, and the result (1,2) is obtained after the addition operation; next, the result (1,2) is passed to the accelerator card MC2, resulting in the next result (4, 3); the next result (4,3) is then passed to the accelerator card MC3 to obtain the final result (6, 7).

Thereafter, in the specification operations of the present disclosure, the final result (6,7) continues to be passed to each accelerator card MC0, MC1, MC2, and MC3, so that data (6,7) is stored in all accelerator cards, thereby completing the specification operations in one accelerator unit.

The accelerating unit shown in fig. 4 can form two independent loops, and each loop can complete the protocol operation of half of the data, thereby accelerating the operation speed and improving the operation efficiency.

In addition, when the acceleration unit performs the specification operation, the acceleration unit can also realize concurrent computation of a plurality of acceleration cards, thereby accelerating the operation speed. For example, accelerator card MC0 has data (0,0) stored therein, accelerator card MC1 has data (1,2) stored therein, accelerator card MC2 has data (3,1) stored therein, and accelerator card MC3 has data (2,4) stored therein. Partial data (0) in the accelerator card MC0 can be transferred to the accelerator card MC1, a result (1) is obtained after addition, partial data (2) in the accelerator card MC1 can be synchronously transferred to the accelerator card MC2, and a result (3) is obtained after addition, so that concurrent operation of the accelerator cards MC1 and MC2 can be realized; and by parity of reasoning, the whole specification operation is completed.

The concurrent computation may further include performing an addition operation by the group of acceleration units, and performing a reduction operation on the operation result of the group of acceleration units and the operation result of another group of acceleration units. For example, the accelerator card MC0 stores data (0,0), the accelerator card MC1 stores data (1,2), the accelerator card MC2 stores data (3,1), and the accelerator card MC3 stores data (2,4), and the data in the accelerator card MC0 may be passed to the accelerator card MC1 for operation to obtain a first set of results (1, 2); synchronously or asynchronously, the data in the accelerator card MC2 may be passed into the accelerator card MC3 for operation to get a second set of results (5, 5). And then, the first group of results and the second group of results are operated to obtain final reduction results (6, 7).

Similarly, in addition to performing the reduction operation in one acceleration unit, the reduction operation may also be performed in an acceleration component or acceleration device. It should be understood that the acceleration device may also be considered an end-to-end connected acceleration assembly.

When the reduction operation is performed in the acceleration component or the acceleration apparatus, the following steps may be included: performing first specification operation on data in an accelerator card of the same accelerator unit to obtain a first specification result in each accelerator unit; and carrying out second reduction operation on the first reduction results in the multiple accelerating units to obtain second reduction results.

Also taking the reduction summation operation as an example, the first step is described above, and for an acceleration apparatus including a plurality of acceleration units, a local reduction operation may be performed in each acceleration unit first, and after the reduction operation in each acceleration unit is completed, an accelerator card in the same acceleration unit will obtain a result of the local reduction operation, which is referred to as a first reduction result.

Next, the first reduction results in all acceleration units may be passed and added in the adjacent acceleration units. Thus, similar to the scaling operation performed in one acceleration unit, the first acceleration unit passes the first scaling result to the second acceleration unit, and after the addition operation is performed in the accelerator card of the second acceleration unit, the result is passed and added. After the last addition, the final result is conducted to each acceleration unit.

It should be noted that since the accelerating assemblies above are not necessarily connected end to end, the final result may be conducted in reverse, rather than in a circular fashion as would be the case if the accelerating units were connected end to end, in the case of conducting the final result to each accelerating unit. The technical solution of the present disclosure is not particularly limited as to how to conduct the final result.

Still further, according to an embodiment of the present disclosure, the acceleration apparatus may be further configured to perform a reduction operation, including: performing first specification operation on data in the accelerator cards of the same accelerator unit to obtain a first specification result; performing intermediate specification operation on first specification results in a plurality of acceleration units of the same acceleration assembly to obtain an intermediate specification result; and carrying out second reduction operation on the intermediate reduction results in the plurality of acceleration components to obtain a second reduction result.

In this embodiment, the reduction operation may be performed in the same acceleration unit first, which has already been described above and will not be described here again.

Then, a specification operation can be performed in each acceleration component, so that each acceleration card in each acceleration component obtains a local specification result in the acceleration component; and then, taking the acceleration assembly as a unit, and carrying out specification operation in the multiple acceleration assemblies, so that each acceleration card acquires a global specification result in the acceleration device.

Various embodiments of the accelerator card matrix are described above, and more specific methods of performing matrix multiplication operations based on the accelerator card matrix are described below.

There is no particular order of operation for both operations S2910 and S2920 in fig. 29. Each sub data matrix may be a small matrix including a plurality of elements or may be a single data, and thus the sub data matrix herein should be broadly understood and should not be merely understood as a matrix form.

After the first data matrix is split into the first sub-data matrix and the second data matrix is split into the second sub-data matrix, the sub-data matrices may be operated on each acceleration of the accelerator card matrix. It should be understood that the number and size of the sub-data matrix is not greater than the size of the accelerator card matrix. For example, for a plurality of sub-data matrices of size 3 x 3 (i.e., a large data matrix is split into 3 x 3 sub-data matrices), the sub-data matrices may be run on matrices of size 3 x 3, 3 x 4, 4 x 3, 4 x 4, etc., i.e., the sub-data matrices do not occupy all of the accelerator cards in the accelerator card matrix, and some accelerator cards may be used for other purposes other than matrix multiplication.

Preferably, the accelerator card matrix is a square accelerator card matrix, and each accelerator card in the square accelerator card matrix may store a first sub data matrix and a second sub data matrix therein. In other words, the number and size of the sub-data matrix may preferably be consistent with the size of the accelerator card matrix, for example, if an accelerator card matrix is 3 × 3 in size, a data matrix may be split into 3 × 3 sub-data matrix sizes. Therefore, the resource of the accelerator card is not wasted, the operation time of each accelerator card is reduced, and the overall operation efficiency of the accelerator card matrix is improved. On the other hand, if a data matrix is split permanently into, for example, 3 × 3 partial data matrices, then a correspondingly sized accelerator card matrix can be used for matrix multiplication. The accelerator card matrix can be selected and combined at will, so that a corresponding topological structure can be selected according to the split sub-data matrix to meet the requirement of matrix multiplication.

Furthermore, after the first sub-data matrix and the second sub-data matrix are stored in the corresponding accelerator card, multiple operations may be performed on the accelerator card, where each operation may obtain a multiplication result, and the multiplication results obtained from each operation may be added through multiple operations, so as to obtain an overall multiplication result of the first data matrix and the second data matrix. The sub-data matrix in the accelerator card can be shifted or transmitted, so that the multiplication result between different sub-data matrices can be obtained according to the rule of multiplication.

Fig. 30 illustrates an example of a multiplication operation performed by a first data matrix and a second data matrix according to an embodiment of the present disclosure.

As shown in fig. 30, the first data matrix is denoted by a, wherein the element Aij represents the position of the sub-data matrix after the first data matrix a is split, and is described herein as a3 × 3 matrix; the second data matrix is represented as B, wherein the element Bij represents the position of the subdata matrix after the second data matrix B is split; the second data matrix is also 3 x 3 in size. The result matrix of the multiplication of the first data matrix a and the second data matrix B is denoted C, where the element Cij denotes the position of each element in the result matrix.

According to the rule of matrix multiplication, the result matrix C is expressed as follows:

C00	C01	C02
			C10	C11	C12
C20	C21	C22

matrix 1

Where matrix 1 may be represented as matrix 2 as follows

Matrix 2

The matrix 2 may in turn be represented as the sum of the following matrices 3-5

A00*B00	A00*B01	A00*B02
			A10*B00	A10*B01	A10*B02
A20*B00	A20*B01	A20*B02

Matrix 3

A01*B10	A01*B11	A01*B12
			A11*B10	A11*B11	A11*B12
A21*B10	A21*B11	A21*B12

Matrix 4

A02*B20	A02*B21	A02*B22
			A12*B20	A01*B21	A12*B22
A22*B20	A21*B21	A22*B22

Matrix 5

Therefore, after the results shown in matrix 3-matrix 5 are calculated in the accelerator card matrix, the results are added to obtain the result of matrix C.

A3 x 3 size accelerator card matrix may be used according to the description above. Fig. 31 shows a schematic diagram of an exemplary 3 x 3-scale accelerator card matrix. In FIG. 31, row 0 includes accelerator cards P00-P02; line 1 includes accelerator cards P10-P12; line 2 includes accelerator cards P20-P22. The 9 sub-data matrices of the data matrix a may be stored in the 9 accelerators respectively, and the 9 sub-data matrices of the data matrix B may be stored in the 9 accelerator cards respectively, each accelerator card performs its own multiplication, and the results of each multiplication may be added to obtain the final matrix multiplication result.

According to one embodiment of the present disclosure, the accelerator cards of each row of the accelerator card matrix are communicably connected end-to-end, and the accelerator cards of each column of the accelerator card matrix are communicably connected end-to-end. It should be understood that the up-card end-to-end communicative connection in the accelerator card matrix is not necessary, but the end-to-end connection has the advantage that the data transfer between the accelerator cards can be directly performed without passing through other accelerator cards, thereby improving the efficiency of data transfer. In fact, it should be understood that the topology of the accelerator card matrix shown in fig. 31 is only a logical result, and not necessarily as shown in fig. 31 in the actual physical layout, but the position of each accelerator card can be adjusted according to the physical space limitation.

According to an embodiment of the present disclosure, performing a multiplication operation of a data matrix by passing the first sub data matrix and the second sub data matrix in the accelerator card matrix may include: every time the first sub data matrix is transmitted, multiplying the stored first sub data matrix and the stored second sub data matrix at each accelerator card to obtain a local multiplication result; and adding a plurality of local multiplication results obtained by multiple passes to obtain a global multiplication result.

In order to enable the accelerator card matrix to perform matrix multiplication, the sub-data matrix in the accelerator card matrix needs to be shifted, that is, the sub-data matrix is transferred from one accelerator card to another accelerator card, and the sub-data matrix is shifted, so that different input sub-data matrices (that is, a first sub-data matrix and a second sub-data matrix) appear in the accelerator card, and different multiplication results are obtained after the different input sub-data matrices are operated, thereby realizing the required matrix multiplication. For convenience of description, the result of each multiplication operation performed by the accelerator card is referred to as a "local multiplication result", and the final matrix multiplication result can be obtained by adding a plurality of local multiplication results, which is referred to as a "global multiplication result" herein.

FIG. 32a shows a schematic diagram of the sub-data matrix passing on the logical level.

According to an embodiment of the present disclosure, multiplying the stored first sub-data matrix Aij and second sub-data matrix Bij to obtain a local multiplication result includes: transmitting each first sub data matrix Aij along the row accelerator card to obtain a first transmission sub data matrix; transmitting each second sub-data matrix Bij along the column accelerator card to obtain a second transmitted sub-data matrix; and each time the transmission is carried out, multiplying the obtained first transmission sub data matrix and the second transmission sub data matrix at each accelerator card to obtain a local multiplication result.

To facilitate the presentation of the position change of each sub-data matrix, the first sub-data matrices a00, a10, and a20 are highlighted, and the second sub-data matrices B00, B01, and B02 are highlighted.

As shown in fig. 32a, the first sub data matrix Aij is subjected to a first data transfer operation. Through the first data transfer operation, the stored accelerator card of the first sub-data matrix Aij is changed as follows:

in the first transfer, each first sub-data matrix Aij is transferred from the current row accelerator card to the target row accelerator card, wherein, in the same direction, the difference between the bit number of the target row accelerator card and the bit number of the current row accelerator card is the same as the row number of the first sub-data matrix Aij.

Taking the first sub-data matrix Aij shown in fig. 32a as an example, the first sub-data matrix a00, a01 and a02 are in the 0 th row of the first data matrix, so that these three sub-matrices remain unchanged during the first pass, i.e. the first sub-data matrices a00, a01 and a02 are respectively stored in the accelerator cards P00, P01 and P02.

For the first sub-matrices a10, a11 and a12, which are in the 1 st row of the first data matrix, according to a delivery rule of the present disclosure, they need to deliver 1 accelerator card, whereby the first sub-data matrices a10, a11 and a12 deliver one accelerator card to the left, respectively, and after their first shift, the first sub-data matrices a11, a12 and a10 are stored in the accelerator cards P10, P11 and P12, respectively.

For the first sub-matrices a20, a21 and a22, which are in the 2 nd row of the first data matrix, they need to transfer 2 accelerator cards according to one transfer rule of the present disclosure, whereby the first sub-matrices a20, a21 and a22 transfer two accelerator cards to the left, respectively, and after their first shift, the first sub-matrices a22, a20 and a21 are stored in the accelerator cards P20, P21 and P22, respectively.

As further shown in fig. 32a, a first data transfer operation is performed on the second sub-data matrix Bij. Through the first data transfer operation, the stored accelerator card of the second sub-data matrix Bij is changed as follows:

in the first transfer, each second sub-data matrix Bij is transferred from the current row accelerator card to the target row accelerator card, wherein, in the same direction, the difference between the bit number of the target row accelerator card and the bit number of the current row accelerator card is the same as the number of rows where the second sub-data matrix Bij is located.

Taking the second sub-data matrix Bij shown in fig. 32a as an example, the first sub-data matrices B00, B10 and B20 are in the 0 th column of the first data matrix, so these three sub-matrices remain unchanged during the first pass, i.e. the second sub-data matrices B00, B10 and B20 are stored in the accelerator cards P00, P10 and P20, respectively.

For the second sub-matrices B01, B11 and B21, which are in the 1 st column of the second data matrix, according to a delivery rule of the present disclosure, they need to deliver 1 accelerator card, whereby the second sub-matrices B01, B11 and B21 deliver one accelerator card upwards respectively, and after their first shift, the second sub-matrices B01, B11 and B21 are stored in accelerator cards P01, P11 and P21 respectively.

For the second sub-matrices B02, B12 and B22, which are in the 2 nd column of the second data matrix, according to a delivery rule of the present disclosure, they need to deliver 2 accelerator cards, whereby the second sub-matrices B02, B12 and B22 deliver 2 accelerator cards up, respectively, and after their first shift, the second sub-matrices B02, B12 and B22 are stored in the accelerator cards P02, P12 and P22, respectively.

After the first transfer, the first sub-matrix Aij and the second sub-matrix Bij stored in each accelerator card may be multiplied, so as to obtain a partial multiplication result as shown in fig. 32 a.

It should be understood that the "number of bits" of the sub-data transfer is from the sub-data matrix itself, but for the hardware accelerator card, it does not necessarily mean the corresponding number of transfers, and it can realize such transfer "number of bits" through the shortest path.

Fig. 32b shows a schematic diagram of the transmission of the first sub-data matrix Aij and the second sub-data matrix Bij in the accelerator card matrix. In fig. 32b, solid lines indicate the transfer routes of the first sub-data matrix Aij, and dotted lines indicate the transfer routes of the second sub-data matrix Bij.

As shown in FIG. 32b, for the first sub-data matrix, no data transfer occurs in the row direction by the accelerator cards P00, P01, and P02; acceleration cards P10, P11, and P12 have data transfers in the row direction, i.e., first sub-matrix A10 transfers from acceleration card P10 to acceleration card P12, first sub-matrix A11 transfers from acceleration card P11 to acceleration card P10, and first sub-matrix A12 transfers from acceleration card P12 to acceleration card P11; data transfers also occur in the row direction for accelerator cards P20, P21, and P22, i.e., first sub-matrix a20 transfers from accelerator card P20 to accelerator card P21, first sub-matrix a21 transfers from accelerator card P21 to accelerator card P22, and first sub-matrix a22 transfers from accelerator card P22 to accelerator card P20. For data in the row 2 accelerator card, there may be two ways of transferring, i.e. transferring two "bits" to the left, or transferring one "bit" to the right, which are equivalent, but in the present disclosure, it is preferable that the data is transferred through the shortest communication path, which helps to improve the efficiency of data transfer.

As also shown in FIG. 32b, for the second data matrix, no data transfer occurs in the column direction for accelerator cards P00, P10, and P20; the acceleration cards P01, P11 and P21 have data transfer in the column direction, i.e. the second sub-data matrix B01 is transferred from the acceleration card P01 to the acceleration card P21, the second sub-data matrix B11 is transferred from the acceleration card P11 to the acceleration card P01, and the second sub-data matrix B21 is transferred from the acceleration card P21 to the acceleration card P11; data transfer also occurs in the column direction for accelerator cards P02, P12, and P22, i.e., the second sub-data matrix B02 is transferred from accelerator card P02 to accelerator card P12, the second sub-data matrix B12 is transferred from accelerator card P12 to accelerator card P22, and the second sub-data matrix B22 is transferred from accelerator card P22 to accelerator card P02. For data in the rank 2 accelerator card, there may be two ways of transferring, i.e. transferring two "bits" up, or transferring one "bit" down, which are equivalent, but in the present disclosure, it is preferable that the data is transferred through the shortest communication path, which helps to improve the efficiency of data transfer.

Therefore, the difference between the number of bits of the target accelerator card and the number of bits of the current accelerator card as described above refers to the relationship between the accelerator card where the sub-data matrix is originally located in the logic level and the target accelerator card, but the sub-data matrix can be transferred through any appropriate or shortest communication path in the actual hardware and physical level.

After the first transfer is completed, in the next transfers, one bit at a time may be transferred.

FIGS. 33a and 33b are diagrams illustrating the transfer of sub-data matrices on a logical level; fig. 33c shows a schematic diagram illustrating the transmission of the first sub-data matrix Aij and the second sub-data matrix Bij in the accelerator card matrix.

As shown in fig. 33a and 33c, the first sub-data matrix and the second sub-data matrix are further passed. That is, each of the first sub-data matrices is passed a bit number to the left, so that the first sub-data matrices a01, a02 and a00 are respectively stored in the accelerator cards P00, P01 and P02, the first sub-data matrices a12, a10 and a11 are respectively stored in the accelerator cards P10, P11 and P12, and the first sub-data matrices a20, a21 and a22 are respectively stored in the accelerator cards P20, P21 and P22; and the second sub-data matrix are all passed up by one bit number, so that the second sub-data matrices B10, B20 and B00 are stored in accelerator cards P00, P10 and P20, respectively, the second sub-data matrices B21, B01 and B11 are stored in accelerator cards P01, P11 and P21, respectively, and the second sub-data matrices B02, B12 and B22 are stored in accelerator cards P02, P12 and P22, respectively. After the second transmission, the sub-data matrix stored in each accelerator card is subjected to multiplication operation, so that a second partial multiplication result is obtained.

As shown in fig. 33b and 33c, the first sub-data matrix and the second sub-data matrix are further transferred. That is, each of the first sub-data matrices is passed a bit number to the left, so that the first sub-data matrices a02, a00 and a01 are respectively stored in the accelerator cards P00, P01 and P02, the first sub-data matrices a10, a11 and a12 are respectively stored in the accelerator cards P10, P11 and P12, and the first sub-data matrices a21, a22 and a20 are respectively stored in the accelerator cards P20, P21 and P22; and the second sub-data matrix are all passed up by one bit number, so that the second sub-data matrices B20, B00 and B10 are stored in accelerator cards P00, P10 and P20, respectively, the second sub-data matrices B01, B11 and B21 are stored in accelerator cards P01, P11 and P21, respectively, and the second sub-data matrices B12, B22 and B02 are stored in accelerator cards P02, P12 and P22, respectively. After the second transmission, the sub-data matrix stored in each accelerator card is subjected to multiplication operation, so that a third partial multiplication result is obtained.

And adding the obtained first local multiplication result, the second local multiplication result and the third local multiplication result to obtain a final global multiplication result.

For the adding process of these local multiplication results, adding the multiple local multiplication results obtained by passing multiple times to obtain the global multiplication result according to one embodiment of the present disclosure may include: and adding the local multiplication result obtained at the current time and the local multiplication result obtained at the last time every time of transmission until the transmission is finished.

In the embodiment of the present disclosure, each time one local multiplication result is obtained through calculation, the currently obtained local multiplication result is added to the previously obtained local multiplication result, and it is not necessary to wait for all local multiplication results to be calculated and added, which is helpful to reduce the use of storage space in the accelerator card and improve the utilization rate of the storage space. Of course, according to another embodiment of the present disclosure, the partial multiplication results obtained each time may be stored, and the addition operation may be performed after all the partial multiplication results are obtained.

The above description has been given by taking a data matrix of 3 × 3 scale as an example, and hereinafter, an exemplary description will be given by taking a data matrix of 4 × 4 scale as an example.

Fig. 34a and 34B show schematic diagrams of a first data matrix a, a second data matrix B and an accelerator card matrix P. To facilitate illustrating the location variation of each sub-data matrix, a first sub-data matrix a00, a10, a20 and a30 are highlighted, and a second sub-data matrix B00, B01, B02 and B03 are highlighted. Initially, the first sub-data matrix Aij and the second sub-data matrix Bij may be correspondingly stored in the accelerator card matrix Pij.

As shown in FIGS. 35a and 34b, for the first sub-data matrix, no data transfer occurs in the row direction by the accelerator cards P00, P01, P02 and P03; acceleration cards P10, P11, P12 and P13 have data transfer in row direction, the number of bits changed is 1, i.e. the first sub-data matrix a10 is transferred from acceleration card P10 to acceleration card P13, the first sub-data matrix a11 is transferred from acceleration card P11 to acceleration card P10, the first sub-data matrix a12 is transferred from acceleration card P12 to acceleration card P11, and the first sub-data matrix a13 is transferred from acceleration card P13 to acceleration card P12; acceleration cards P20, P21, P22 and P23 also have data transfer in row direction, the number of bits changed is 2, i.e. the first sub-matrix a20 is transferred from acceleration card P20 to acceleration card P22, the first sub-matrix a21 is transferred from acceleration card P21 to acceleration card P23, the first sub-matrix a22 is transferred from acceleration card P22 to acceleration card P20, and the first sub-matrix a23 is transferred from acceleration card P23 to acceleration card P21; the accelerator cards P30, P31, P32 and P33 also have data transmission in row direction, the number of changed bits is 3, that is, the first sub-data matrix a30 is transmitted from the accelerator card P30 to the accelerator card P31, the first sub-data matrix a31 is transmitted from the accelerator card P31 to the accelerator card P32, and the first sub-data matrix a32 is transmitted from the accelerator card P32 to the accelerator card P33; and a first child data matrix A33 is passed from accelerator card P33 to accelerator card P30.

As shown in FIGS. 35a and 34b, for the second data matrix, no data transfer occurs in the column direction for the accelerator cards P00, P10, P20, and P30; the accelerator cards P01, P11, P21 and P31 have data transfer in the column direction, the number of bits changed is 1, i.e. the second sub-data matrix B01 is transferred from the accelerator card P01 to the accelerator card P31, the second sub-data matrix B11 is transferred from the accelerator card P11 to the accelerator card P01, the second sub-data matrix B21 is transferred from the accelerator card P21 to the accelerator card P11, and the second sub-data matrix B31 is transferred from the accelerator card P31 to the accelerator card P21; the accelerator cards P02, P12, P22 and P32 also have data transfer in the column direction, the number of bits changed is 2, i.e. the second sub-data matrix B02 is transferred from the accelerator card P02 to the accelerator card P22, the second sub-data matrix B12 is transferred from the accelerator card P12 to the accelerator card P32, the second sub-data matrix B22 is transferred from the accelerator card P22 to the accelerator card P02, and the second sub-data matrix B32 is transferred from the accelerator card P32 to the accelerator card P12; the accelerator cards P03, P13, P23 and P33 also have data transfer in the column direction, the number of bits changed is 3, i.e. the second sub-data matrix B03 is transferred from the accelerator card P03 to the accelerator card P13, the second sub-data matrix B13 is transferred from the accelerator card P13 to the accelerator card P23, the second sub-data matrix B23 is transferred from the accelerator card P23 to the accelerator card P33, and the second sub-data matrix B33 is transferred from the accelerator card P33 to the accelerator card P03.

It should be understood that the above-mentioned number of change bits is directed to a specific direction, and is changed to the left for the first sub-data matrix and is changed to the up for the second sub-data matrix. In addition, although fig. 34b does not show direct connections between the 0 th accelerator card and the 2 nd accelerator card, between the 1 st accelerator card and the 3 rd accelerator card, between the 0 th accelerator card and the 2 nd accelerator card, and between the 1 st accelerator card and the 3 rd accelerator card, in the technical solution of the present disclosure, there may also be direct connections between these accelerator cards, so that data, for example, data may be transferred from the 0 th accelerator card to the 2 nd accelerator card by only one transfer. One skilled in the art may reduce the number of hops or times of data transfer in any suitable topology.

In the examples shown in fig. 35b to 35d, a change of the number of transfer bits being 1 occurs, and such a change has been described in detail above in connection with fig. 33a and 33b and will not be described again here.

Therefore, in the technical scheme of the disclosure, the matrix multiplication can be performed by adopting a plurality of accelerator cards, and the operation mode fully utilizes the strong computing power of the accelerator cards, can perform high-performance data operation and can efficiently process mass data.

The present disclosure also provides an electronic device, comprising: one or more processors; and a memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described above.

The present disclosure also provides a computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method as described above.

The technical scheme disclosed by the invention can be applied to the field of artificial intelligence and is realized or realized in an artificial intelligence chip. The chip may exist alone or may be included in a computing device.

Fig. 26 is a schematic structural diagram of a combined processing device according to an embodiment of the disclosure, as shown in the figure,

fig. 26 illustrates a combined processing device 2600 that includes the computing device 2602, the interconnection interface 2604, and other processing devices 2606 described above. The computing device according to the present disclosure interacts with other processing devices to collectively perform operations specified by a user. Fig. 26 is a schematic view of a combined treatment apparatus.

Other processing devices include one or more of general purpose/special purpose processors such as Central Processing Units (CPUs), Graphics Processing Units (GPUs), neural network processors, and the like. The number of processors included in the other processing devices is not limited. The other processing devices are used as interfaces of the machine learning arithmetic device and external data and control, and comprise data transportation to finish basic control of starting, stopping and the like of the machine learning arithmetic device; other processing devices may cooperate with the machine learning computing device to perform computing tasks.

The interconnect interface is used to transfer data and control instructions between the computing device (including, for example, a machine learning computing device) and other processing devices. The computing device acquires required input data from other processing devices and writes the required input data into a storage device on the computing device chip; control instructions can be obtained from other processing devices and written into a control cache on a computing device slice; the data in the storage module of the computing device can also be read and transmitted to other processing devices.

Optionally, the architecture may further include a storage device 2608, the storage device being connected to the computing device and the other processing devices, respectively. The storage device is used for storing data in the computing device and the other processing devices, and is particularly suitable for storing all data which cannot be stored in the internal storage of the computing device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle and video monitoring equipment, the core area of a control part is effectively reduced, the processing speed is increased, and the overall power consumption is reduced. In this case, the interconnection interface of the combined processing device is connected to some component of the apparatus. Some parts are such as camera, display, mouse, keyboard, network card, wifi interface.

In some embodiments, the present disclosure also discloses a chip packaging structure, which includes the above chip.

In some embodiments, the disclosure further discloses a board card comprising the chip packaging structure. Referring to fig. 27, an exemplary card is provided that may include other kits in addition to the chip 2702, including but not limited to: a memory device 2704, an interface device 2706 and a control device 2708.

The memory device is connected with the chip in the chip packaging structure through a bus and used for storing data. The memory device may include a plurality of groups of memory cells 2710. Each group of the storage units is connected with the chip through a bus. It is understood that each group of the memory cells may be a DDR SDRAM (Double Data Rate SDRAM).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage device may include 4 sets of the storage unit. Each group of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. In one embodiment, each group of the memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory unit.

The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used to enable data transfer between the chip and an external device 2712 (e.g., a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. In another embodiment, the interface device may also be another interface, and the disclosure does not limit the specific expression of the other interface, and the interface unit may implement the switching function. In addition, the calculation results of the chip are still transmitted back to an external device (e.g. a server) by the interface means.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing and/or a plurality of processing circuits in the chip.

In some embodiments, the present disclosure also discloses an electronic device or apparatus, which includes the above board card.

Electronic devices or apparatuses include data processing apparatuses, robots, computers, printers, scanners, tablets, smart terminals, cell phones, automobile data recorders, navigators, sensors, cameras, servers, cloud servers, cameras, video cameras, projectors, watches, headsets, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.

The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance apparatus, a B-ultrasonic apparatus and/or an electrocardiograph.

It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, optical, acoustic, magnetic or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may be implemented in the form of a software program module.

The integrated units, if implemented in the form of software program modules and sold or used as stand-alone products, may be stored in a computer readable memory. Based on such understanding, when the technical solution of the present disclosure can be embodied in the form of a software product, the computer software product is stored in a memory and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The above embodiments of the present disclosure are described in detail, and specific examples are applied herein to explain the principles and implementations of the present disclosure, and the description of the above embodiments is only used to help understand the method and its core idea of the present disclosure; meanwhile, for a person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A method of performing multiplication operations of data matrices in an accelerator card matrix, wherein the accelerator card matrix comprises M accelerator cards logically formed as an accelerator card matrix of size L x N, L and N being integers no less than 2 and communicatively connected between adjacent accelerator cards, the data matrices comprising a first data matrix and a second data matrix; the method comprises the following steps:

splitting a first data matrix into a plurality of first subdata matrixes, and respectively storing the plurality of first subdata matrixes in the plurality of accelerator cards;

splitting a second data matrix into a plurality of second sub data matrixes, and respectively storing the plurality of second sub data matrixes in the plurality of accelerator cards;

and carrying out multiplication operation of a data matrix by transmitting the first sub data matrix and the second sub data matrix in the accelerator card matrix.

2. The method of claim 1, wherein the accelerator cards of each row of the accelerator card matrix are communicatively connected end-to-end and the accelerator cards of each column of the accelerator card matrix are communicatively connected end-to-end.

3. The method of claim 1 or 2, wherein the accelerator card matrix is a square accelerator card matrix, and each accelerator card in the square accelerator card matrix stores a first sub data matrix and a second sub data matrix.

4. The method of any of claims 1-3, wherein multiplying a data matrix by passing the first and second sub-data matrices in the accelerator card matrix comprises:

each time the data is transmitted, multiplying the stored first sub data matrix and the stored second sub data matrix at each accelerator card to obtain a local multiplication result;

and adding a plurality of local multiplication results obtained by multiple passes to obtain a global multiplication result.

5. The method of claim 4, wherein multiplying the stored first and second sub-data matrices to obtain a partial multiplication result comprises:

transmitting each first sub data matrix along the row accelerator card to obtain a first transmission sub data matrix;

transmitting each second sub data matrix along the column accelerator card to obtain a second transmission sub data matrix;

and each time the transmission is carried out, multiplying the obtained first transmission sub data matrix and the second transmission sub data matrix at each accelerator card to obtain a local multiplication result.

6. The method of claim 5, wherein passing each first sub-data matrix along the row accelerator card to obtain a first passed sub-data matrix comprises:

in the first transfer, each first sub data matrix is transferred from the current row accelerator card to a target row accelerator card, wherein in the same direction, the difference between the bit number of the target row accelerator card and the bit number of the current row accelerator card is the same as the row number of the first sub data matrix;

in other transfers, each first sub-data matrix has a number of bits transferred in the accelerator card of 1.

7. The method of claim 5 or 6, wherein passing each second sub-data matrix along a column accelerator card to obtain a second passed sub-data matrix comprises:

in the first transmission, transmitting each second subdata matrix from the current row accelerator card to a target row accelerator card, wherein in the same direction, the difference between the bit number of the target row accelerator card and the bit number of the current row accelerator card is the same as the number of rows where the second subdata matrix is located;

in other transfers, each second sub-data matrix has a number of bits transferred in the accelerator card of 1.

8. The method of any of claims 4-7, wherein adding the plurality of local multiplication results from the plurality of passes to obtain a global multiplication result comprises:

and adding the local multiplication result obtained at the time and the local multiplication result obtained at the last time every time the transmission is carried out, until the transmission is finished.

9. An electronic device, comprising:

one or more processors; and

memory having stored therein computer-executable instructions that, when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-8.

10. A computer-readable storage medium comprising computer-executable instructions that, when executed by one or more processors, perform the method of any one of claims 1-8.