CN117332831A

CN117332831A - Distributed neural network accelerator system

Info

Publication number: CN117332831A
Application number: CN202311271771.XA
Authority: CN
Inventors: 胡杏; 韩虎生; 党朴成; 宋新开
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2024-01-02

Abstract

The invention provides a distributed neural network accelerator system, wherein a host node is configured to: remotely authenticating each accelerator node, compiling the model by using a neural network compiler, generating a data flow graph, and determining the dependency relationship of subgraphs among the accelerator nodes; and splitting the compiled model, and distributing each subgraph to each accelerator node. The computation layer of each sub-graph includes: an interface layer, expressed by a transferable tensor, comprising transfer tensor data and first auxiliary data; the transfer tensor data is stored in an off-chip memory, and the first auxiliary data is stored in the on-chip memory; an internal layer, expressed by a normal tensor, comprising normal tensor data and second auxiliary data; the normal tensor data is stored in the off-chip memory, the second auxiliary data comprises a second tensor version number and a second tensor MAC, the second tensor version number is stored in the on-chip memory, and the second tensor MAC is stored in the off-chip memory. Which reduces the memory access overhead and storage overhead of the VN.

Description

Distributed neural network accelerator system

Technical Field

The invention relates to the technical field of distributed systems, in particular to a distributed neural network accelerator system.

Background

As technology expands, computing systems increasingly rely on hardware accelerators to improve performance and energy efficiency. For example, modern Machine Learning (ML) models, such as Deep Neural Networks (DNNs), are often quite computationally intensive and increasingly run on hardware accelerators. Also, hardware accelerators are widely used for other computationally intensive workloads, such as video decoding, signal processing, encryption operations, genome assembly, and the like.

Meanwhile, as training data increases, the size of the neural network model increases, and a single accelerator has difficulty in supporting training and reasoning of a large model, so that a multi-machine multi-card-based distributed training mode has been widely applied. Specifically, the current parallel strategies of pipeline parallelism, data parallelism, tensor parallelism and the like are included. The pipeline is used for dividing the model into a plurality of parts in sequence, each accelerator only stores a part of weight parameters of the model and executes corresponding calculation, and storage and calculation pressures are flattened. And dividing the data in parallel, wherein each accelerator keeps a complete model, only training or reasoning partial data, and carrying out multi-card model aggregation and distribution after finishing the current turn. And the tensors are connected in parallel to divide the large tensors in the calculation process into a plurality of accelerators, so that the storage pressure of the accelerators is reduced, for example, dividing the multi-head self-attention layer by head or dividing the weight of the full-connection layer by row or column.

In many applications, the hardware accelerator may handle proprietary or sensitive data, which requires strong security protection. For example, ML algorithms typically require the collection, storage, and processing of large amounts of personal data and potentially private data from users to train models. Furthermore, because of its high computational demands, training and reasoning are typically performed on remote servers, rather than on client devices such as smartphones, which means that private data and ML models may be exposed if the server is corrupted or malicious.

One promising approach to provide strong confidentiality and integrity assurance in an untrusted environment is to create a hardware protected Trusted Execution Environment (TEE). Encryption protection of off-chip memory is one basic technique to implement a hardware-protected TEE. In conventional secure processor designs, off-chip memory protection is a major source of performance overhead. For general purpose processors, memory protection schemes need to be able to handle any sequence of memory accesses to arbitrary memory locations, and memory accesses are typically protected at cache block granularity. In a secure processor, the decryption delay is hidden using a counter encryption mode, where the counter value is typically a combination of a memory address and a Version Number (VN). The version number is stored in memory and incremented each time an encrypted block is written. To protect the integrity of off-chip memory, a Message Authentication Code (MAC) needs to be appended to each cache block in memory. Furthermore, because VNs cannot be stored entirely on-chip, must be stored in external storage, their integrity also needs to be protected. For this reason, the traditional scheme builds a Merkle Tree structure, reducing the integrity assurance of all VNs to the root node. By placing the root node on the chip and reconstructing the tree at access time and comparing with the root node stored on the chip, if the same, the VN is considered as not tampered, and the integrity is ensured.

VN and MAC incur significant performance overhead and storage overhead. Aiming at the problem, the recent work MGX (MGX: near-Zero Overhead Memory Protection for Data-Intensive Accelerators) can directly infer the version number VN by using the scheduling information of the compiling stage and the on-chip state of the accelerator during execution based on the fixed data flow characteristics of the neural network, so that the on-chip storage of the VN is not needed, the construction of a tree and the access cost during access are saved, and the cost is greatly reduced. However, the secure memory problem of distributed multi-accelerators remains unsolved. The MMT-Efficient Distributed Secure Memorywith Migratable Merkle Tree article can save the software re-encryption and decryption overhead among multiple machines by constructing a migratable Merkle Tree structure. But the encryption granularity is still 64B memory lines, and the storage overhead is too large.

By analyzing the two methods, the two schemes cannot be directly applied in a distributed scene of large model reasoning. On the one hand, the data flow of the neural network under the large model is no longer completely static. In pipelined parallelism, a single node only retains the tensor of the receiving layer in order to save storage, while other forward reasoning tensors are regenerated by the tensor of the receiving layer in back propagation, so that the completely static design of MGX cannot meet the requirement; on the other hand, the MAC itself of MMT approach also occupies a large memory overhead, which would result in additional machine count requirements in a memory-limited large model training scenario. It is estimated that 10% of the memory overhead may result in 10-20% of the additional machine count, which would be a significant cost due to the accelerator's expensive nature, such as a GPU. Meanwhile, the MMT method relies on the address to finish encryption and decryption, which leads to the fact that the ciphertext after communication needs re-encryption and decryption of the TEE layer.

Disclosure of Invention

Aiming at the problems, the invention provides a distributed neural network accelerator system, which designs an on-chip distribution mode of VN during pipeline parallel large model training, and reduces memory access overhead and storage overhead of the VN; and the communication is carried out by taking the tensor of the neural network layer as granularity, so that the communication overhead is reduced.

To achieve the above object, an aspect of the present invention provides a distributed neural network accelerator system, including a host node, a plurality of accelerator nodes, the host node configured to:

remote authentication is performed with each of the accelerator nodes,

compiling the model by utilizing a neural network compiler to generate a data flow graph, and determining the dependency relationship of subgraphs among the accelerator nodes in the data flow graph;

cutting the compiled model;

assigning each subgraph to each of the accelerator nodes;

wherein the calculation layer of each sub-graph comprises:

an interface layer for transmitting a transferable tensor, the transferable tensor comprising transfer tensor data, and first auxiliary data;

the transfer tensor data is stored in off-chip memory,

the first auxiliary data is stored in an on-chip memory and comprises a first tensor state, a first tensor version number and a first tensor MAC;

the data of the internal layer is expressed by a common tensor and comprises common tensor data and second auxiliary data;

the normal tensor data is stored in off-chip memory,

the second auxiliary data comprises a second tensor version number and a second tensor MAC, wherein the second tensor version number is stored in an on-chip memory, and the second tensor MAC is stored in an off-chip memory.

Optionally, the interface layer includes:

an input layer that relies on data from other accelerator nodes as input;

and the output data of the output layer is used for the input of other accelerator nodes.

Optionally, the remote authentication by the host node to all accelerator nodes includes:

the host node generates an encryption key by using an on-chip trusted root;

the host node requests remote authentication from each accelerator node and establishes a trusted software communication channel;

each accelerator node generates a report according to the on-chip trusted root and sends the report to the host node;

the host node verifies the credibility of the report by means of the manufacturer, and assigns a global accelerator number and the encryption key to each accelerator node to complete remote authentication.

Optionally, in the process of segmenting the compiled model,

dividing the compiled model by taking the calculation delay of each sub-graph as an optimization target by reducing the dependency relationship and balance of the sub-graphs among the accelerator nodes;

for a model with a simple structure, slicing is carried out according to layering of the model;

and for the model with a complex structure, the optimal allocation strategy of reinforcement learning is utilized for segmentation.

Optionally, the host node constructs the transferable tensor, and transmits the transferable tensor through the interface layer, including:

loading plaintext data into a trusted execution environment, performing AES encryption to obtain ciphertext data, and generating the first tensor MAC;

setting the first tensor version number to 0, wherein the first tensor state is a legal state, and generating the first auxiliary data;

acquiring all accelerator nodes according to the compiled dependency relationship of the subgraphs among the accelerator nodes;

generating a common cooperative key for each accelerator node, storing the common cooperative key in the on-chip memory, encrypting and transmitting the first auxiliary data by using the cooperative key, and executing decryption on the on-chip memory by a receiver after obtaining the encrypted first auxiliary data;

and transmitting the transfer tensor data.

Optionally, when the first auxiliary data transmission is started, the first tensor state of the host node is set from a legal state to an illegal state, and reading and writing of data are forbidden;

after the transfer tensor data transfer is completed, the first tensor state of the host node is set to a read-only state.

Optionally, after receiving the first auxiliary data, the receiver verifies the version number of the first tensor, and verifies that the version number of the first tensor of the transferable tensor is the largest among the version numbers of all tensors of the current node.

Optionally, the internal layer performs inference calculations, including:

the accelerator node reserves the feature version number and the weight version number on the on-chip memory, wherein the feature version number increases and decreases according to a current on-chip execution state, and the second tensor version number comprises: feature version number, weight version number;

the accelerator node determines the result of the current layer according to the compiled data flow graph, reserves the current layer and the second tensor version number when the dependency relationship is maintained, and deletes the redundant second tensor version number after the dependency relationship is released;

if the input data of the inner layer is a transferable tensor, acquiring the first auxiliary data stored in the on-chip memory, and if the input data is an ordinary tensor, acquiring the second auxiliary data by the on-chip execution state;

and after the calculation of the inner layer of the current subgraph is completed, the output ordinary tensor is packed into a transferable tensor to be transmitted through the interface layer.

Optionally, on a single node, in the forward reasoning process, a part of unused segments are reserved between the larger first tensor version number of the transferable tensor of the last layer and the first tensor version number of the transferable tensor of the previous layer;

in performing the recalculation, the first tensor version number of the last layer transferable tensor is set to the first tensor version number of the previous layer transferable tensor plus one.

Optionally, each of the accelerator nodes includes a memory controller configured to:

the encryption and decryption engine is responsible for encryption of memory writing on the local machine and decryption of memory reading;

the integrity verification engine is responsible for generating a first tensor MAC and/or a second tensor MAC by a memory on the local machine to carry out integrity verification;

the communication verification engine is responsible for exchanging keys with the remote node, establishing a communication channel and transmitting data stored in the on-chip memory; a kind of electronic device with high-pressure air-conditioning system

After receiving the data stored in the on-chip memory, verifying the validity of the remote node version number, and after the verification is completed, receiving transfer tensor data and/or normal tensor data, and the first tensor MAC and/or the second tensor MAC.

The advantages of the invention are as follows:

the distributed neural network accelerator system provided by the invention uses the tensor of the neural network layer as granularity to carry out communication, and two data structures of the transferable tensor and the common tensor are arranged, in the subgraph to be processed by each accelerator node, the interface layer data is represented by the transferable tensor, and the data of the inner layer is represented by the common tensor, so that compared with the data which uses the memory row (64B) as the transmission granularity, the proportion of auxiliary data MAC and VN is reduced, and the communication cost is reduced.

Meanwhile, the transfer tensor data of the transferable tensor is stored on an off-chip memory, and the first auxiliary data of the transfer tensor data is stored in the on-chip memory; and storing the normal tensor data of the normal tensor on an off-chip memory, storing a second tensor version number VN of second auxiliary data of the normal tensor data on the on-chip memory, deducing according to the on-chip execution state, and storing a second tensor MAC in the off-chip memory. The allocation mode can reduce the memory access overhead and the storage overhead of the VN.

Drawings

FIG. 1 illustrates a data structure diagram of a distributed neural network accelerator architecture provided by an embodiment of the present invention;

FIG. 2 shows a flow diagram of global authentication;

FIG. 3 illustrates a flow diagram of the transfer of a transferable tensor through an interface layer;

the recalculation-induced VN conflicts are shown in the left-hand diagram of fig. 4, the incremental guarantee of VN by reserving VN segments is shown in the right-hand diagram;

fig. 5 shows a schematic diagram of an accelerator node.

Wherein,

10-interface layer;

11-transferable tensor;

111-transfer tensor data;

112-first assistance data;

1121-a first tensor state;

1122-first tensor version number;

1123-a first tensor MAC;

20-inner layer;

21-normal tensor;

211-ordinary tensor data;

212-second assistance data;

2121-second tensor version number;

2122-second tensor MAC;

30-a memory controller;

31-an encryption and decryption engine;

32-an integrity verification engine;

33-a communication verification engine;

40-on-chip memory;

50-off-chip memory;

S11-S14, S31-S34: and (3) step (c).

Detailed Description

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

A distributed neural network accelerator system comprising a host node, a plurality of accelerator nodes, wherein the host node is configured to implement the process of:

remote authentication is carried out on each accelerator node, so that global authentication is realized;

compiling the model by utilizing a neural network compiler to generate a data flow graph, and determining the dependency relationship of subgraphs among all accelerator nodes in the data flow graph;

cutting the compiled model;

each sub-graph is assigned to each accelerator node.

In this embodiment, as shown in fig. 1, to support a trusted distributed neural network accelerator architecture, two data structures, namely a transferable tensor and a normal tensor, are designed for communication with the neural network layer tensor as granularity. Each of the sub-graphs to be processed by the accelerator nodes includes an interface layer and an internal layer. The method comprises the following steps:

in this embodiment, for the interface layer 10, which is mainly used for transmitting the transferable tensor 11, the interface layer data is represented by the transferable tensor, and the transferable tensor 11 specifically includes the transfer tensor data 111 and the first auxiliary data 112, so that it can be transmitted between different nodes. Wherein the transfer tensor data 11 is stored on the off-chip memory 50; the first auxiliary data 112 is stored in the on-chip memory 40, the first auxiliary data 112 comprising a first tensor state 1121, a first tensor version number 1122, and a first tensor MAC 1123. In a specific implementation, the first tensor state mainly comprises an illegal state, a legal state and a read-only state, so that at most one accelerator node in the global is guaranteed to have update and write rights, and replay attacks are prevented; the first tensor version number VN is used for preventing replay attack, the value of the first tensor version number VN is increased in a global way when updated each time, and any tensors are different and have the same version number, so that the uniqueness of the version numbers can be ensured, the use of physical addresses can be avoided, and the problem of re-encryption and decryption caused by the change of the addresses after the transfer tensor transmission is avoided; the first tensor MAC is used to guarantee the integrity of this transfer tensor.

Furthermore, for the structure of the interface layer, it specifically includes: an input layer and an output layer, wherein the input layer depends on data of other accelerator nodes as input; the output data of the output layer is used for the input of other accelerator nodes.

In the present embodiment, for the internal layer 20, for performing internal reasoning calculation, the data of the internal layer is represented by a normal tensor 21, and includes normal tensor data 211 and second auxiliary data 212. Wherein the normal tensor data 211 is stored on the off-chip memory 50. The second auxiliary data 212 includes a second tensor version number 2121 and a second tensor MAC2122, and for the second tensor version number VN 2121, mainly including a feature version number, a weight version number, a gradient version number, and the like, which are stored in the on-chip memory 40, and are obtained by the control processor through on-chip execution state reasoning; the second tensor MAC2122 is stored in off-chip memory 50 to save on-chip memory space.

Therefore, in this embodiment, communication is performed with the neural network layer tensor as granularity, two data structures, namely, a transferable tensor and a normal tensor, are set, in the sub-graph to be processed by each accelerator node, the interface layer data is represented by the transferable tensor, and the data of the inner layer is represented by the normal tensor, which reduces the proportion of auxiliary data MAC and VN and reduces the communication overhead compared with the transmission granularity with the memory line (64B).

The distributed neural network accelerator system of the present invention will be described in detail below:

in this embodiment, for remote authentication of the host node to all accelerator nodes, a global authentication process is implemented, as shown in fig. 2, fig. 2 shows a flow chart of global authentication, which specifically includes:

in the distributed case, it is necessary to ensure that the remote accelerator node hardware is trusted. Therefore, first, the host node generates an encryption key using the on-chip trusted root (S11), which is common to all nodes in the distributed neural network accelerator system, and includes the AES encryption key and the MAC generation key, so as to avoid re-encryption and decryption during data transmission between nodes. Then, the host node requests remote authentication from each accelerator node and establishes a trusted software communication channel (S12); each accelerator node generates a report from the on-chip root of trust and sends it to the host node (S13). Finally, the host node verifies the trustworthiness of the report by means of the manufacturer, assigns a global accelerator number and encryption key to each accelerator node, and completes the remote authentication (S14).

Further, in this embodiment, during the process of model compiling and splitting by the host node, the host node first compiles the model by using the neural network compiler, generates a dataflow graph, and determines the dependency relationship of the subgraphs among the accelerator nodes in the dataflow graph. In some embodiments, a neural network compiler such as a TVM may be selected to compile the model. After compiling the model, the compiled model is further segmented. In this embodiment, in the process of splitting the compiled model, the compiled model is split by reducing the dependency relationship and balance of the subgraphs between the accelerator nodes so that the calculation delay of each subgraph is used as an optimization target. The dependency relationship of the subgraphs among the accelerator nodes is reduced to be used as an optimization target, so that the data dependency among the subgraphs can be reduced as little as possible, and the traffic among the accelerator nodes is reduced; the latter load balances the stages of the pipeline to achieve optimal overall throughput. In a specific implementation, for a model with a simple structure, for example, a Resnet series, segmentation can be performed according to layering of the model; for models with complex structures, such as networks searched using NAS techniques, and more complex and diverse accelerator clusters, segmentation can be performed using reinforcement learning to learn the optimal allocation strategy.

In this embodiment, after the above model compiling and splitting are completed, each sub-graph is further distributed to each accelerator node, and the transferable tensor data is transmitted through the interface layer of the current sub-graph, and is subjected to reasoning calculation through the internal layer of the current sub-graph. Specific:

in this embodiment, after global authentication and model compilation segmentation are completed, the host node includes plaintext input data, and the host node transmits the constructed transferable tensor through the interface layer for transmission to subsequent accelerator node computation. As shown in particular in fig. 3, fig. 3 shows a schematic flow diagram of the transfer of the transferable tensor through the interface layer.

First, a transferable tensor is constructed, specifically generating first assistance data, namely: loading plaintext data into a trusted execution environment, performing AES encryption to obtain ciphertext data, and generating a first tensor MAC; meanwhile, the first tensor version number VN is set to 0, and the first tensor state is a legal state, through which the first auxiliary data can be generated (S31), thereby completing the transferable tensor construction.

Further, upon completion of the transferable tensor construct, efficient trusted transmission begins. First, first auxiliary data is transmitted, namely: acquiring all accelerator nodes according to the compiled dependency relationship of the subgraphs among the accelerator nodes (S32); for each accelerator node, a common cooperative key is generated and stored in an on-chip memory, the generated first auxiliary data is encrypted and transmitted by using the cooperative key (using, for example, diffie Hellman protocol), and the receiver acquires the encrypted first auxiliary data and performs decryption in the on-chip memory (S33). Finally, after the first auxiliary data transmission is completed, the transfer tensor data is further transmitted (S34), and because the transfer tensor data is already ciphertext, re-encryption is not required, so that communication overhead is small. Through the above process, the transmission process of the whole transferable tensor through the interface layer can be completed.

In addition, in this embodiment, when the first auxiliary data transmission is started, the first tensor state of the host node is set from the legal state to the illegal state, and reading and writing of data are prohibited; then, after the transfer tensor data transfer is completed, the first tensor state of the host node is set to the read-only state. Meanwhile, after receiving the first auxiliary data, the receiver checks the version number of the first tensor, and verifies that the version number of the first tensor of the transferable tensor is the largest in the version numbers of all tensors of the current node so as to avoid replay attack. Meanwhile, as the used version number VN value of the first tensor is globally increased, any transferable tensor cannot have the same version number, and the participation of addresses is not needed when the first tensor MAC is encrypted and generated, so that encryption and decryption are not needed again due to the change of the addresses when the first tensor MAC is transmitted to other nodes. In this embodiment, the global version number is used to discard the address when encrypting and generating the first tensor MAC, so as to avoid the re-encryption and decryption problem of the TEE layer after communication transmission.

Further, in this embodiment, in the process of performing the inference calculation on the inner layer of the current sub-graph, the data of the inner layer is represented by a normal tensor, and includes normal tensor data and second auxiliary data. Specifically, during reasoning, the accelerator node reserves a feature version number and a weight version number of the second auxiliary data on the on-chip memory, wherein the feature version number increases and decreases according to the current on-chip execution state. Meanwhile, the accelerator node determines the result of the current layer according to the compiled data flow graph, reserves the current layer and the second tensor version number VN when the dependency relationship is maintained, and deletes the redundant second tensor version number after the dependency relationship is released. In addition, the weights are read-only during reasoning, so that all the weights share one weight version number. When reasoning is executed, if the input data of the inner layer is a transferable tensor, acquiring first auxiliary data stored in an on-chip memory; if the input data is a common tensor, acquiring second auxiliary data from the current on-chip execution state in a control processor of each accelerator node; wherein the second version number is incremented by one for each update in all tensors. And after the calculation of the inner layer of the current subgraph is completed, the output common tensor data is packaged into transferable tensor data, and the transferable tensor data is further transmitted through an interface layer.

Furthermore, in a specific implementation, the recalculation problem for a single accelerator node is handled in a manner that reserves VN segments. Specifically, as shown in fig. 4, the VN conflict caused by recalculation is shown in the left diagram in fig. 4, and the incremental guarantee of VN by reserving VN segments is shown in the right diagram. That is, in order to save the video memory on a single node, other activation tensors except the receiving layer are discarded in the forward reasoning, and are regenerated in the gradient back transmission. Since the interface layer transmission requires an incremental nature of the inter-node VNs to avoid replay attacks, the gradient anti-transmission time recalculation will result in regenerated tensor version numbers being larger than the tensor VN obtained later on the interface layer transmission, resulting in collisions, as shown in the left-hand diagram of fig. 4. In this embodiment, in the forward reasoning process, the first tensor version number of the transferable tensor of the last layer is set to be larger, and a part of unused segments are reserved between the first tensor version number of the transferable tensor of the last layer and the first tensor version number of the transferable tensor of the previous layer, for example, in the forward reasoning process of the right graph of fig. 4, the node layer 3 vn=3, the current layer 4 vn=7, and a part of unused segments are reserved between the two layers VN. In addition, in the recalculation process, the first tensor version number of the transferable tensor of the last layer is set to be one plus the first tensor version number of the transferable tensor of the previous layer, as in the recalculation process of the right-hand diagram of fig. 4, the node layer 3 vn=5 and the current layer 4 vn=6. In addition, in this embodiment, the length of the reserved segment is determined by the number of newly generated tensors at the recalculation time obtained at the compiling time, so that it is ensured that the maximum VN at the recalculation time is still smaller than the VN value retransmitted back to the current node by the subsequent node.

In addition, in this embodiment, on the basis of the accelerator node, only the accelerator node memory controller needs to be extended in order to support the distributed security architecture. As shown in fig. 5, fig. 5 shows an architecture diagram of accelerator nodes, each of which includes a memory controller 30 configured with an encryption and decryption engine 31, an integrity verification engine 32, and a communication verification engine 33, wherein the encryption and decryption engine is responsible for encryption of local memory writes and decryption of memory reads; the integrity verification engine is responsible for generating a first tensor MAC and/or a second tensor MAC by the internal memory on the local machine to carry out integrity verification; the communication verification engine is responsible for exchanging keys with the remote nodes, establishing a communication channel to transmit data stored in the on-chip memory, verifying the validity of the version number of the remote nodes, namely, the version number is larger than the previous version number after receiving the data stored in the on-chip memory, and receiving transfer tensor data and/or common tensor data, a first tensor MAC and/or a second tensor MAC and the like after the verification is finished.

In summary, in this embodiment, the communication is performed with the tensor of the neural network layer as granularity, two data structures including a transferable tensor and a normal tensor are provided, and in the sub-graph to be processed by each accelerator node, two computing layers including an interface layer and an internal layer are included, the interface layer data is represented by the transferable tensor, and the internal layer data is represented by the normal tensor, which reduces the proportion of auxiliary data MAC and VN and reduces the communication overhead compared with the case that the memory line (64B) is used as the transmission granularity. Meanwhile, the global version number is utilized during encryption and MAC generation, the address is abandoned, and the re-encryption and decryption problems of the TEE layer after communication transmission are avoided. In addition, in the present embodiment, transfer tensor data of the transferable tensor is stored on the off-chip memory, and first auxiliary data is stored on the on-chip memory; storing the ordinary tensor data of the ordinary tensor on an off-chip memory, storing a second tensor version number VN of second auxiliary data of the ordinary tensor on the on-chip memory, and performing state reasoning according to the on-chip; the second tensor MAC is saved to off-chip memory. The allocation mode can reduce the memory access overhead and the storage overhead of the VN. Furthermore, communication conflicts are avoided by reserving VN segments for recalculation problems.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be applied, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A distributed neural network accelerator system comprising a host node, a plurality of accelerator nodes, wherein the host node is configured to:

remote authentication is performed with each of the accelerator nodes,

cutting the compiled model;

assigning each subgraph to each of the accelerator nodes;

wherein the calculation layer of each sub-graph comprises:

the transfer tensor data is stored in off-chip memory,

the normal tensor data is stored in off-chip memory,

2. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

the interface layer comprises:

an input layer that relies on data from other accelerator nodes as input;

3. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

the host node remotely authenticates to all accelerator nodes, including:

the host node generates an encryption key by using an on-chip trusted root;

4. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

in the process of slicing the compiled model,

5. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

the host node constructs the transferable tensor, for transmission through the interface layer, comprising:

and transmitting the transfer tensor data.

6. The system of claim 5, wherein the system further comprises a controller configured to control the controller,

when the first auxiliary data transmission is started, the first tensor state of the host node is set from a legal state to an illegal state, and reading and writing of data are forbidden;

7. The system of claim 5, wherein the system further comprises a controller configured to control the controller,

after receiving the first auxiliary data, the receiver verifies the first tensor version number, and verifies that the first tensor version number of the transferable tensor is the largest among the version numbers of all tensors of the current node.

8. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

the internal layer performs inference calculations, including:

the accelerator node reserves a feature version number and a weight version number on the on-chip memory, wherein the feature version number increases and decreases according to a current on-chip execution state, and the second tensor version number comprises: feature version number, weight version number;

9. The system of claim 8, wherein the system further comprises a controller configured to control the controller,

on a single node, in the forward reasoning process, a part of unused segments are reserved between the larger first tensor version number of the transferable tensor of the last layer and the first tensor version number of the transferable tensor of the previous layer;

10. The system of claim 1, wherein the system further comprises a controller configured to control the controller,

each of the accelerator nodes includes a memory controller configured to: