CN115002050B

CN115002050B - Workload proving chip

Info

Publication number: CN115002050B
Application number: CN202210838519.1A
Authority: CN
Inventors: 刘明; 蔡凯; 田佩佳; 张雨生
Original assignee: Sunlune Technology Beijing Co Ltd
Current assignee: Shenglong Singapore Pte Ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-09-30
Anticipated expiration: 2042-07-18
Also published as: CN115002050A; WO2024016661A1

Abstract

A workload certification chip comprises at least one group of computing units, at least one group of storage units, at least two first data nodes and at least two second data nodes, wherein each data node comprises a plurality of groups of first node inlets and a plurality of groups of first node outlets; wherein: the computing unit is used for sending a message for requesting data to the storage unit through the first data node when carrying out workload certification computation; the storage unit is arranged to store a data set for use in the workload proof computation, and to send a message containing data to the computation unit via the second data node in response to the message from the computation unit.

Description

Workload proving chip

Technical Field

The disclosed embodiments relate to, but are not limited to, the field of computer application technologies, and more particularly, to a workload proving chip.

Background

In the blockchain technology, the generation of a block needs to be completed by a Proof of Work (POW), which is a hash function and can be solved by using a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), or the like, and a large data set needs to be accessed by a random address during the solving process, and the entire data set is generally stored in a memory or a display. In the application of block chain workload proof, the computational power is in direct proportion to the data bandwidth, so that very high on-chip bandwidth is needed, but the traditional CPU, GPU or FPGA structure can not solve the problem well.

Disclosure of Invention

The embodiment of the disclosure provides a workload proving chip, which improves the bandwidth in a chip.

The disclosed embodiment provides a workload certification chip, which comprises at least one group of computing units, at least one group of storage units, at least two first data nodes and at least two second data nodes, wherein each data node comprises a plurality of groups of first node inlets and a plurality of groups of first node outlets; wherein:

the computing unit is used for sending a message for requesting data to the storage unit through the first data node when carrying out workload certification computation;

the storage unit is configured to store a data set for use in workload attestation calculations, and to send a message containing data to the computing unit via the second data node in response to the message from the computing unit;

the first data node is used for sending a message which is sent by the computing unit and used for requesting data to the storage unit;

the second data node is used for sending the message containing the data sent by the storage unit to the computing unit.

By adopting the workload proving chip provided by the embodiment of the disclosure, the message intercommunication between the computing unit and the storage unit is realized through the data node, the structure is simple, the efficiency is high, and the bandwidth in the chip is high.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the disclosure. Other advantages of the disclosure may be realized and attained by the instrumentalities and methods described in the specification, claims, and drawings.

Drawings

The accompanying drawings are included to provide an understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the example serve to explain the principles of the disclosure and not to limit the disclosure. The shapes and sizes of the various elements in the drawings are not to be considered as true proportions, but are merely intended to illustrate the present disclosure.

Fig. 1 is a schematic structural diagram of a workload proving chip according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of another data node provided in the embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a compression unit and a decompression unit in accordance with an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a data exchange subunit according to an embodiment of the present disclosure;

fig. 5a is a schematic diagram of a node including 4 first data nodes according to an embodiment of the present disclosure;

fig. 5b is a schematic diagram of a node including 4 second data nodes according to an embodiment of the present disclosure;

fig. 6 is a schematic diagram of a connection of a data exchange unit including 6 first data nodes (or second data nodes) according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of an internal structure of a data subunit in the data exchange unit shown in FIG. 6;

fig. 8 is a schematic diagram of a connection of a data exchange unit including 9 first data nodes (or second data nodes) according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a node entry of an internal structure of a data subunit in the data exchange unit shown in fig. 8.

Detailed Description

The present disclosure describes embodiments, but the description is illustrative rather than limiting and it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the embodiments described in the present disclosure. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Any feature or element of any embodiment may be used in combination with or instead of any other feature or element in any other embodiment, unless expressly limited otherwise.

The present disclosure includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The embodiments, features and elements of the present disclosure that have been disclosed may also be combined with any conventional features or elements to form unique inventive aspects as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive aspects to form yet another unique inventive aspect, as defined by the claims. Thus, it should be understood that any features shown and/or discussed in this disclosure may be implemented individually or in any suitable combination. Accordingly, the embodiments are not limited except as by the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

Further, in describing representative embodiments, the specification may have presented the method and/or process as a particular sequence of steps. However, to the extent that the method or process does not rely on the particular order of steps set forth herein, the method or process should not be limited to the particular sequence of steps described. Other orders of steps are possible as will be understood by those of ordinary skill in the art. Therefore, the particular order of the steps set forth in the specification should not be construed as limitations on the claims. Furthermore, the claims directed to the method and/or process should not be limited to the performance of their steps in the order written, and one skilled in the art can readily appreciate that the sequences may be varied and still remain within the spirit and scope of the embodiments of the present disclosure.

Fig. 1 is a schematic structural diagram of a workload certification chip according to an embodiment of the present disclosure, where the workload certification chip includes at least one set of computation units, at least one set of storage units, at least two first data nodes, and at least two second data nodes, each data node (including the first data node and the second data node) includes multiple sets of first node inlets and multiple sets of first node outlets, the set of computation units is connected to the set of first node inlets of the first data nodes in a one-to-one correspondence manner, the set of first node outlets of the first data nodes is connected to a set of storage units in a one-to-one correspondence manner, the set of storage units is connected to a set of first node inlets of the second data nodes in a one-to-one correspondence manner, that is, the set of computation units is connected to the set of storage units through at least 2 first data nodes, the group of storage units is connected with the group of computing units through at least 2 second data nodes; wherein:

In this embodiment, the first data node network is used to implement message transmission from the computing unit to the storage unit, and the second data node network is used to implement message transmission from the storage unit to the computing unit.

In an exemplary embodiment, the first data nodes may be interconnected by a mesh, the second data nodes are interconnected by a mesh, each data node (including the first data node and the second data node) includes a plurality of routing units, a plurality of arbitration units, a data switching unit, an interconnection unit, a plurality of first node entries, a plurality of first node exits, one or more second node entries, and one or more second node exits, an input of each routing unit is connected to one of the first node entries, a first output of each routing unit is connected to a first input of one of the arbitration units in a one-to-one correspondence manner, a second output of each routing unit is connected to a first input of the data switching unit, and a first output of the data switching unit is connected to the second node exit, the second input end of the data switching unit is connected with the second node inlet, the second output end of the data switching unit is connected with the second input end of each arbitration unit, the output end of each arbitration unit is connected with the input end of the interconnection unit in a one-to-one correspondence manner, the output end of the interconnection unit is connected with the first node outlet in a one-to-one correspondence manner, and the second node inlet and the second node outlet are set to be connected with other data nodes, wherein:

the routing unit is used for receiving a message of a first node entrance and sending the message to the arbitration unit and/or the data exchange unit;

the data exchange unit is used for receiving a message of a second node inlet and sending the message to the arbitration unit, and is used for receiving the message sent by the routing unit and outputting the message through a second node outlet;

the arbitration unit is used for receiving the message sent by the routing unit and/or the data exchange unit and sending the message to the first node outlet through the interconnection unit.

The interconnect unit may be adapted to send the message sent by the arbitration unit to any of the first node outlets.

The message comprises a request or data, and for any data node, if the first node inlet of the data node is connected with a computing unit, namely the data node is the first data node, the message transmitted by the data node is a message for requesting data, and the second node inlet and the second node outlet of the data node are connected with the first data node; if the first node inlet of the data node is connected with the storage unit, namely the data node is the second data node, the message transmitted by the data node is a message containing data, and the second node inlet of the data node and the second node outlet are connected with the second data node. The plurality herein includes 2 or more than 2.

The following describes the above data node structure in detail. Each data node includes: a plurality of routing units, a plurality of arbitration units, a data exchange unit, an interconnection unit, a plurality of first node entries, a plurality of first node exits, at least one second node entry, at least one second node exit, wherein:

the routing unit comprises input ports, first output ports and second output ports, each input port is connected with a first node inlet, the first output port is connected with the input of the arbitration unit, the second output port is connected with the input of the data exchange unit, and the routing unit is used for receiving messages input by the first node inlets and forwarding the messages to the arbitration unit or the data exchange unit; in this example, each node entry is connected with an independent routing unit, each routing unit corresponds to an independent arbitration unit, and the routing unit can route the message to the arbitration unit or the data exchange unit according to the destination contained in the message;

the data exchange unit comprises a plurality of first input ports, a plurality of first output ports, a plurality of second input ports and a plurality of second output ports, each first input port is connected with the second output end of the routing unit, each first output port is connected with a second node outlet, each second input port is connected with a second node inlet, each second output port is connected with an input of an arbitration unit, and the data exchange unit is used for forwarding messages to other data nodes or receiving messages sent by other data nodes;

the arbitration unit comprises a first input port connected with the first output port of the routing unit, a second input port connected with the second output port of the data exchange unit and an output port connected with the interconnection unit, and is used for receiving messages sent by the routing unit and/or the data exchange unit and sending the messages to the interconnection unit through the arbitration output port;

the interconnection unit comprises a plurality of input ports and a plurality of output ports, each input port is connected with one output port of the arbitration unit, each interconnection output port is connected with one first node outlet, and the interconnection unit is used for sending the message output by the arbitration unit to any one first node outlet.

By adopting the workload proving chip provided by the embodiment of the disclosure, message intercommunication is realized between the computing unit and the storage unit through the data node, and because the data node adopts a grid topological structure, is connected with other data nodes through the data exchange unit, and realizes data node output through the interconnection unit, the workload proving chip realized through the data node has the advantages of simple structure, high efficiency and high bandwidth in the chip, and the workload proving chip realized through the data node also has higher efficiency and bandwidth in the chip.

The number of the first node entry and the first node exit may be the same or different, that is, the number of the computing unit connected to each first data node and the number of the storage unit connected to each second data node may be the same or different, and the number of the first node entry and the first node exit may range from 2 to 16348, that is, the number of a group of computing units is 2 to 16348, and the number of a group of storage units is 2 to 16348. The number of each data node may be 2 or more than 2, for example, 4, 6, 9 or even more, which is not limited in the present application. The first data nodes can adopt a mesh (mesh) interconnection topological structure, the second data nodes can also adopt a mesh interconnection topological structure, the first data nodes are not connected with the second data nodes, the data nodes are arranged into a regular mesh, and each data node is only connected with adjacent data nodes in the same row or column.

The following describes the internal units of the data nodes in detail, and the internal units of the first data node and the second data node have the same composition.

In an exemplary embodiment, the arbitration unit may be an arbitration structure with backpressure and cache, and the arbitration unit may cache a certain number of messages and send the messages to the corresponding interconnection units when the messages can be received by the corresponding interconnection units, and when the cache is full, generate backpressure for the previous-stage unit to prevent the messages sent by the previous-stage unit from being lost due to failure to be received, and when the cache is not full, the backpressure is released. Similarly, the routing unit may also be a routing structure with backpressure and caching.

In an exemplary embodiment, the arbitration unit is further configured to set different weights for a plurality of input ports of the arbitration unit, respectively, where the weight value of each input port represents the number of messages that can be processed by the input port continuously. The arbitration unit can design the weight ratio of each port according to the data volume of each input port, which determines the proportion of messages passed by each port, and when the proportion setting is consistent with the proportion of requests or data actually required to be passed, the efficiency of the whole system is improved.

In addition, the arbitration unit may further set different priorities for the plurality of input ports of the arbitration unit, when the arbitration unit processes the message, the arbitration unit selects the input port with the highest priority and to-be-processed message, and after the message processing of the input port is completed, the priority of each input port is readjusted, and the adjustment method may be, for example: and after the message processing of the input port with the highest priority and the message to be processed is completed, adjusting the priority of the input port to be the lowest.

The arbitration unit takes the weight ratio as 1: the weighted round robin arbitration of 3 is explained as an example. Taking the example that the arbitration unit includes two input ports S1 and S2, assuming that the default priorities of the two input ports are S1> S2, and assuming that the weight of S1 is 3, the weight of S2 is 1, wherein the S1 port may be a port to which the data switching unit is connected, and the S2 port may be a structure to which the routing unit is connected. In this example, the number of weights is related to the number of requests to send, a weight of 3 means that at most 3x messages can be sent consecutively, a weight of 1 means that at most x messages can be sent consecutively, x is an integer greater than or equal to 1, and if the weight is 0, the port is considered to be closed and no message is allowed to pass through. In this example, the principle of priority adjustment is to adjust the priority of a port to the lowest after the port has sent a message or no message.

The arbitration unit weighted round robin arbitration process is illustrated as follows: assuming that the port S1 receives the request and the priority of the current port S1 is the highest, since the weight of the port S1 is 3, the port S1 can continuously send 3x requests at most, and when the port S1 finishes continuously sending 3x messages or S1 has no messages, the arbitration unit adjusts the priority order to: s2> S1; if the port S2 has a request at this time, since S2 is the port having the highest priority and the weight of the port S2 is 1, the port S2 can continuously transmit x messages at most, and when the port S2 has continuously transmitted x messages or the port S2 has no messages, the arbitration unit adjusts the priority order to: s1> S2. The weighted polling arbitration mode can improve the processing efficiency of the arbitration unit, and has obvious effect when the data pressure is large. In other embodiments, a fixed weight round robin arbitration scheme (e.g., fixed 1:1 per port weight ratio) or a fixed priority arbitration scheme may be employed.

The interconnection unit comprises a plurality of input ports and a plurality of output ports, and data input by any input port can be output through any output port, that is, the interconnection unit can send a message to any first node outlet according to the destination of the message. The number of the input ports and the output ports can be the same or different, and the specific number can be set according to the chip requirements, for example, the number can be set to 128 or 4096, and the like. The interconnection unit may be implemented, for example, by using a full crossbar (or a fully associative crossbar), which is a multi-input multi-output structure, and data may enter from any input to reach any output.

In an exemplary embodiment, the chip may further include a compression unit and a decompression unit, each routing unit is connected with the data exchange unit through the compression unit, and the data exchange unit is connected with each arbitration unit through the decompression unit. Fig. 2 is a schematic structural diagram of another data node provided in the embodiment of the present disclosure, in this example, a second output port of each routing unit is connected to an input port of a compression unit, and an output port of the compression unit is connected to a first input port of a data switching unit. The second output port of the data exchange unit is connected with the input port of the decompression unit, and the output port of the decompression unit is connected with the second input port of each arbitration unit.

The compression unit is used for compressing the number of buses and compressing m routes of buses input by the m routing units into n routes of buses output to the data exchange unit, for example, the compression unit comprises m input ports and n output ports, m and n are positive integers greater than zero, and m > n. The compression unit can compress the number of buses connected with the plurality of routing units, so that the number of buses is reduced, namely the number of buses is compressed from m to n, and the complexity of the data exchange unit is reduced. The number of buses can be compressed because when a message entering from the first node inlet passes through the routing unit, a part of the message is routed to the arbitration unit, so that the bus pressure routed to the compression unit is necessarily reduced, and therefore, the compression unit can use a smaller number of buses to carry the messages, taking the workload proving chip containing 4 first data nodes as an example, the buses can be as follows, 4: compression ratio of 3, i.e. m: n = 4: 3, since the message from the first node entry passes through the routing unit, there is 1/4 probability to the arbitration unit and 3/4 probability to the compression unit. The compressed buses (still groups of buses) are connected to the data exchange unit.

The decompression unit has the function opposite to that of the compression unit, restores the number of the buses to be the same as that of the arbitration units, comprises n input ports and m output ports, restores n paths of buses input by the data exchange unit to m paths of buses and respectively inputs the n paths of buses to the m arbitration units, and facilitates the arbitration operation of the buses by decompressing the number of the buses from n to m.

Fig. 3 is an example of a compression unit and a decompression unit, and in this example, taking compression and decompression of 4 groups of buses as an example, the compression unit can compress 4 groups of buses into 3 groups, and the decompression unit can restore 3 groups of buses into 4 groups, so that fewer buses can be used to transmit data without affecting the chip function. In the figure, S00, S01, S02 and S03 are data sources and are respectively connected with buses S10, S11, S12 and S13; the buses S10, S11, S12, S13 are connected to the compression unit S2 and to the arbitration units S220, S221, S222 in the compression unit S2, respectively. The arbitration units S220, S221, S222 are preferably weighted round robin arbiters, in some examples, the arbitration units S220, S221, S222 may also use a common arbiter or a round robin arbiter. The routing unit S20 is connected to the cache units S210, S211, S212, respectively; the buffer units S210, S211 and S212 are respectively connected with the arbitration units S220, S221 and S222; the arbitration units S220, S221, S222 are connected to the compressed buses S30, S31, S32; the buses S30, S31, S32 are connected to the decompression unit S4, and to the routing units S400, S401, S402 of S4, respectively; the routing units S400, S401 and S402 are respectively connected with the restored buses S50, S51 and S52; the routing units S400, S401 and S402 are all connected with the arbitration unit S41; the arbitration unit S41 is preferably a round-robin arbiter, but a normal arbiter may also be used; the arbitration unit S41 is connected to the restored bus S53; the buses S50, S51, S52 and S53 are respectively connected with data destinations S60, S61, S62 and S63.

The data compression work flow is as follows:

the data sources S00, S01, S02, S03 send data to the buses S10, S11, S12, S13, respectively; wherein: the data of the bus S13 is divided into 3 parts by the routing unit S20, and is respectively cached in the cache units S210, S211 and S212; the data of the cache unit S210 and the data of the bus S10 are generated into the data of the bus S30 by the arbitration unit S220; the data of the buffer unit S211 and the data of the bus S11 are generated into the data of the bus S31 through the arbitration unit S221; the data of the cache unit S212 and the data of the bus S12 are generated into the data of the bus S32 by the arbitration unit S222; completing data compression;

the data decompression workflow is as follows:

the buses S30, S31, S32 transfer data to the decompression unit S4; the routing unit S400 receives the data of the bus S30, separates the data of the bus S10 and sends the separated data to the bus S50, completes the restoration of the data of the bus S10, and sends the separated data of the bus S13 to the arbitration unit S41; the routing unit S401 receives the data of the bus S31, separates out the data of the bus S11 and sends the data to the bus S51, completes the restoration of the data of the bus S11, and sends the separated data of the bus S13 to the arbitration unit S41; the routing unit S402 receives the data of the bus S32, separates the data of the bus S12 and sends the separated data to the bus S52, completes the restoration of the data of the bus S12, and sends the separated data of the bus S13 to the arbitration unit S41; the arbitration unit S41 receives the data of the routing units S400, S401 and S402, sends the data to the bus S53 and completes the data restoration of the bus S13; the buses S50, S51, S52, S53 send data to data destinations S60, S61, S62, S63, respectively.

The data exchange unit in each data node may include k data exchange subunits, where k is a positive integer greater than or equal to 2, and a value of k depends on the number of routing units or a compression ratio of the compression unit. Specifically, the method comprises the following steps:

if the data exchange unit is directly connected with the routing unit (the structure shown in fig. 1), the number of the data exchange subunits is the same as that of the routing unit, each data exchange subunit comprises a group of input/output ports, namely a first input port and a second output port, for connecting with the routing unit and the arbitration unit, and one or more groups of input/output ports, namely a second input port and a first output port, for connecting with a second node inlet and a second node outlet, wherein the first input port is connected with the routing unit, one first output port is connected with one second node outlet, one second input port is connected with one second node inlet, and the second output port is connected with the arbitration unit.

If the data exchange unit is connected with the compression unit and the decompression unit respectively (the structure is shown in fig. 2), the number of the data exchange sub-units is the same as the number of the output ports of the compression unit. Each data exchange subunit comprises a group of input and output ports, namely a first input port and a second output port, which are used for being connected with the compression unit and the decompression unit, and one or more groups of input and output ports, namely a second input port and a first output port, which are used for being connected with a second node inlet and a second node outlet, wherein the first input port is connected with the compression unit, one first output port is connected with one second node outlet, one second input port is connected with one second node inlet, and the second output port is connected with the decompression unit. Therefore, after the compression unit and the decompression unit are connected, the complexity of the data exchange unit can be reduced due to the reduction of the number of buses.

Each data exchange subunit comprises multiple groups of routing subunits and arbitration subunits, the number of the routing subunits is the same as that of the arbitration subunits, the routing subunits and the arbitration subunits are connected in pairs, the number of the routing subunits and the arbitration subunits depends on the number of adjacent nodes of the data node where the data exchange unit is located, specifically, the number of the adjacent data nodes can be +1, and if the number of the adjacent nodes of the current data node is 2, the number of the routing subunits and the arbitration subunits is 2+ 1. In each data exchange subunit, the first input port is connected with one routing subunit, the first output port is connected with one arbitration subunit, one second input port is connected with one routing subunit, and one second input port is connected with one arbitration subunit.

Taking two adjacent data nodes as an example, a group (including one input and one output) of bus-connected data exchange subunit structures is shown in fig. 4. In the figure, the data exchange subunit is a pairwise interconnection structure comprising three groups of routing subunits and arbitration subunits, wherein one group of routing subunits and arbitration subunits are respectively connected with a compression unit (and a routing unit when the compression unit is not provided) and a decompression unit (and an arbitration unit when the decompression unit is not provided), the other two groups of routing subunits and arbitration subunits are respectively connected with the data exchange units of two adjacent data nodes, the routing subunits are connected with the arbitration subunits of the data exchange subunits of the adjacent nodes, and the arbitration subunits are connected with the routing subunits of the data exchange subunits of the adjacent nodes. And the k data exchange subunits form a data exchange unit.

The arbitration sub-unit within the data exchange sub-unit may employ a weighted round robin arbitration scheme. Weighted round-robin arbitration needs to configure a weight for each input port, where the weight ratio represents the ratio of the message volume passed by each input port, and taking the example that a workload certification chip includes 4 first data nodes, when data exchange units in the 4 first data nodes are arranged according to 2x2 and message routing is performed in a horizontal routing mode and then a vertical routing mode (i.e. if a diagonal node is to transmit a message, the message is routed to a horizontal neighboring node first and then to a message destination), in each data exchange subunit, an entry connected to a horizontal node in an arbitration subunit connected to a decompression unit: the weight ratio of the entries connected to the longitudinal nodes is 1: 2; the entrance weight ratio of other arbitration subunits in the data exchange subunit is 1: 1. the implementation process of the weighted round robin arbitration scheme is described in the foregoing, and details are not described here, and the efficiency of the data exchange unit can be improved by using the weighted round robin arbitration scheme. In other embodiments, the arbitration sub-unit within the data exchange sub-unit may also use round robin arbitration or fixed priority arbitration.

Fig. 5 is a schematic structural diagram of a workload certification chip provided in an embodiment of the present disclosure, where fig. 5a is a schematic structural diagram of connection relationships of 4 first data nodes, and fig. 5b is a schematic structural diagram of connection relationships of 4 second data nodes, in this example, each data node includes a compression unit and a decompression unit, structures of the data nodes in the diagrams are the same, the first data node and the second data node both adopt a mesh topology of 2 × 2, and a structure of a data exchange subunit included in each data exchange unit is shown in fig. 4. Assuming that the computing unit a11 starts to perform workload proving computation and needs to request data in the storage unit B41, which is denoted as request 1, as shown in fig. 5a, the request 1 is first sent to a routing unit of the first data node 1, which is correspondingly connected to the computing unit a11, the request 1 is cached by a corresponding routing unit of the first data node 1, when the corresponding routing unit processes the cached request, the request 1 is sent to the data exchange unit of the first data node 1 through a compression unit, the request 1 is sent to the data node 4 through the data exchange unit of the first data node 4, and then sent to the storage unit B41 through the decompression unit, the arbitration unit and the interconnection unit of the first data node 4. The request 1 accesses the storage unit B41 to obtain the requested data, which is denoted as data 1, and as shown in fig. 5B, the data 1 is sent to the computing unit a11 via the second data node 4 and the second data node 1 in sequence, which is similar to the request 1 and will not be described again here. So far the computing unit A11 completes the request for data located on memory unit B41. Any computing unit can obtain data required by workload certification from any storage unit by referring to the above process, and perform workload certification computation.

Fig. 6 is a schematic diagram of interconnection of 6 data switching units in 6 data nodes when the workload certification chip structure includes 6 first data nodes (or second data nodes), where the data switching units are distributed in a mesh topology of 2 × 3. At this time, since the data exchange units located in the middle row are connected with the adjacent 3 data nodes, the number of the routing subunits and the arbitration subunits in the data exchange subunits in each data exchange unit is 3+ 1. The internal structure of each data subunit is shown in fig. 7, and comprises 4 groups of routing subunits and arbitration subunits which are interconnected two by two.

Fig. 8 is a schematic diagram of interconnection of 9 data switching units in 9 data nodes when the workload certification chip includes 9 first data nodes (or second data nodes), where the data switching units are distributed in a mesh topology of 3 × 3. At this time, since the data exchange unit located in the middle is connected with the adjacent 4 data nodes, the number of the routing subunit and the arbitration subunit in the data exchange subunit in each data exchange unit is 4+ 1. The internal structure of each data subunit is shown in fig. 9, and comprises 5 groups of routing subunits and arbitration subunits which are interconnected two by two.

By adopting the scheme of the embodiment of the disclosure, the workload proving chip with 2-9 × 2 data nodes can be realized, wherein 2-9 represents the number of the first data nodes or the second data nodes. Although mesh distribution is exemplified herein, it is not excluded that other topologies may be used, for example, a star (star) structure may be used when the number of data nodes is small, such as 3 or 5.

Taking the example of the total need of connecting 120 computing units and 120 storage units, if the implementation is simply performed by using a full crossbar, a 120 × 120 full crossbar is required, which is difficult to implement in the current technological level, and if the implementation is simply performed by using a mesh structure, a 16 × 8 mesh arrangement is required, which results in very low efficiency. However, with the workload proving chip structure provided by this embodiment, taking a structure that 4 first data nodes and 4 second data nodes are used as an example, each data node includes 30 groups of gates, that is, each data node is connected to 30 computing units and 30 storage units, and the interconnection unit of each data node may be a full crossbar switch of 30x30, so that the workload proving chip only needs 4 × 2 full crossbar switches of 30x30 and 2 groups of mesh interconnects of 2x2 to implement message intercommunication between any computing unit and any storage unit. Because each mesh node shares the port number, the problem that the scale is too large and cannot be realized due to the fact that the number of all-crossbar switch ports is too large is solved, the mesh node can be realized by using fewer data nodes, and the mesh node is simple in structure and high in efficiency. Meanwhile, the workload proving chip provided by the embodiment of the disclosure can obtain higher on-chip bandwidth, and the workload proving chip can realize the on-chip bandwidth of about 6144GB/s which far exceeds the on-chip bandwidth of 1004GB/s of the current highest-end GPU under the condition of 1024bit wide of the port and 500M clock frequency in actual measurement.

In the description of the embodiments of the present disclosure, it should be noted that the terms "mounted," "connected," and "connected" are to be construed broadly and may be, for example, a fixed connection, a detachable connection, or an integral connection unless otherwise explicitly stated or limited; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The meaning of the above terms in the present disclosure can be understood by those of ordinary skill in the art as appropriate.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A workload certification chip is characterized by comprising at least one group of computing units, at least one group of storage units, at least two first data nodes and at least two second data nodes, wherein each data node comprises a plurality of groups of first node inlets and a plurality of groups of first node outlets, the computing units are connected with the first node inlets of the first data nodes in a one-to-one correspondence mode, the first node outlets of the first data nodes are connected with the storage units in a one-to-one correspondence mode, the storage units are connected with the first node inlets of the second data nodes in a one-to-one correspondence mode, and the first node outlets of the second data nodes are connected with the computing units in a one-to-one correspondence mode; wherein:

the computing unit is used for sending a message for requesting data to the storage unit through the first data node when the workload certification computation is carried out;

2. The workload certification chip according to claim 1,

the first data nodes are interconnected by grids, the second data nodes are interconnected by grids, each data node comprises a plurality of routing units, a plurality of arbitration units, a data exchange unit, an interconnection unit, a plurality of first node inlets, a plurality of first node outlets, one or more second node inlets and one or more second node outlets, the input end of each routing unit is connected with one first node inlet, the first output end of each routing unit is connected with the first input end of one arbitration unit in a one-to-one correspondence manner, the second output end of each routing unit is connected with the first input end of the data exchange unit, the first output end of the data exchange unit is connected with the second node outlet, and the second input end of the data exchange unit is connected with the second node inlet, the second output end of the data exchange unit is connected with the second input end of each arbitration unit, the output end of each arbitration unit is connected with the input end of the interconnection unit in a one-to-one correspondence manner, the output end of the interconnection unit is connected with the first node outlet in a one-to-one correspondence manner, and a second node inlet and a second node outlet are set to be connected with other data nodes, wherein:

the routing unit is used for receiving the message of the first node entrance and sending the message to the arbitration unit and/or the data exchange unit;

the data exchange unit is used for receiving the message of the second node inlet and sending the message to the arbitration unit, and is used for receiving the message sent by the routing unit and outputting the message through the second node outlet;

the arbitration unit is configured to receive a message sent by the routing unit and/or the data exchange unit, and send the message to the first node outlet through the interconnection unit.

3. The workload certification chip according to claim 2, wherein the data node further comprises a compression unit and a decompression unit, each of the routing units is connected to the data exchange unit through the compression unit, and the data exchange unit is connected to each of the arbitration units through the decompression unit; wherein:

the compression unit comprises m input ports and n output ports and is used for compressing m paths of buses input by the m routing units into n paths of buses and outputting the n paths of buses to the data exchange unit;

the decompression unit comprises n input ports and m output ports and is used for restoring n paths of buses input by the data exchange unit into m paths of buses which are respectively input into m arbitration units;

wherein m and n are positive integers greater than zero, and m > n.

4. The workload certification chip according to claim 2,

the data switching unit comprises a plurality of data switching subunits, the number of the data switching subunits is the same as that of the routing units, and each data switching subunit comprises a first input port used for being connected with the routing unit, a first output port used for being connected with the outlet of the second node, a second input port used for being connected with the inlet of the second node, and a second output port used for being connected with the arbitration unit.

5. The workload certification chip according to claim 3,

the data exchange unit comprises n data exchange subunits, and each data exchange subunit comprises a first input port connected with the compression unit, a first output port connected with the outlet of the second node, a second input port connected with the inlet of the second node, and a second output port connected with the decompression unit.

6. The workload certification chip according to claim 4 or 5,

each data exchange subunit comprises a plurality of groups of routing subunits and arbitration subunits, the routing subunits and the arbitration subunits are connected with each other in pairs, wherein the first input port is connected with one routing subunit, the first output port is connected with one arbitration subunit, the second input port is connected with one routing subunit, and the second input port is connected with one arbitration subunit.

7. The workload certification chip according to claim 6,

the arbitration subunit is configured to set different weights for the plurality of input ports of the arbitration subunit, where the weight of each input port represents the number of messages that can be continuously processed by the input port, and to set different priorities for the plurality of input ports of the arbitration subunit, where when the arbitration subunit processes a message, the arbitration subunit selects an input port with the highest priority and to-be-processed message, and after the message processing of the input port is completed, readjusts the priority of each input port.

8. The workload certification chip according to claim 2,

the arbitration unit is further configured to set different weights for the plurality of input ports of the arbitration unit, where the weight value of each input port represents the number of messages that can be continuously processed by the input port, and to set different priorities for the plurality of input ports of the arbitration unit, respectively.

9. The workload certification chip according to claim 2,

the interconnection unit comprises a plurality of input ports and a plurality of output ports, each input port is connected with one arbitration unit, each output port is connected with one first node outlet, and the interconnection unit is used for sending the message output by the arbitration unit to any one first node outlet.

10. The workload certification chip according to claim 2 or 9, wherein the interconnection unit is a full crossbar.