WO2021138842A1

WO2021138842A1 - Methods and apparatuses for processing neural network

Info

Publication number: WO2021138842A1
Application number: PCT/CN2020/070943
Authority: WO
Inventors: Yang Jiao; Yongquan ZHOU; Jun He
Original assignee: Alibaba Group Holding Limited
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2021-07-15

Abstract

Related to methods and apparatuses for processing a neural network, the methods include: receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores; receiving a first input of the plurality of inputs at a core of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and communicating the first input from the core to another core of the plurality of cores.

Description

METHODS AND APPARATUSES FOR PROCESSING NEURAL NETWORK

BACKGROUND

In machine learning (ML) or deep learning (DL) , a neural network (NN) is a mechanism that basically mimics how a human brain learns. A deep neural network (DNN) is a category of neural networks. Over the years, neural networks (e.g., DNNs) have demonstrated successes in various domains such as computer vision, natural language processing and the like.

Many neural networks have a weight matrix of large size, which requires significant computational and storage resources for neural network training or deployment. Some techniques have been developed to process neural networks with weight matrix of large size on multi-core processing units. For example, one solution is to utilize a level-2 shared memory (i.e., sharing memory across multiple processors) to expand the storage space. But this solution is complicated, difficult to be managed, and would significantly increase communication delay (e.g., read, write, or transmission delay) .

SUMMARY

In some embodiments, an exemplary method for processing a neural network, comprising: receiving a plurality of inputs at a processing unit, the processing unit including a plurality of cores, and a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; receiving a first input of the plurality of inputs at a core of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.

In some embodiments, an exemplary heterogeneous acceleration processing unit (HAPU) can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores. A weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores. A core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core. The communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.

In some embodiments, an exemplary terminal can include a host unit and a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit. The HAPU can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores. A weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores. A core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core. The communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.

In some embodiments, an exemplary non-transitory computer readable storage media stores a set of instructions. The instructions are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising: receiving a plurality of inputs; receiving a first input of the plurality of inputs at a core of the plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.

Additional objects and advantages of the present disclosure will be set forth in part in the following detailed description, and in part will be obvious from the description, or may be learned by practice of the present disclosure. The objects and advantages of the present disclosure will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.

It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosed embodiments.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which comprise a part of this specification, illustrate several embodiments and, together with the description, serve to explain the principles and features of the disclosed embodiments. In the drawings:

FIG. 1 is a schematic diagram of an exemplary neural network, according to some embodiments of the present disclosure.

FIG. 2 is a block diagram of an exemplary heterogeneous acceleration processing unit (HAPU) , according to some embodiments of the present disclosure.

FIG. 3A is a block diagram of an exemplary machine learning system, according to some embodiments of the present disclosure.

FIG. 3B is a schematic diagram of an exemplary cloud system, according to some embodiments of the present disclosure.

FIG. 4 is a flowchart of an exemplary method for processing a neural network, according to some embodiments of the present disclosure.

FIG. 5 is a flowchart of another exemplary method for processing a neural network, according to some embodiments of the present disclosure.

FIG. 6 is a schematic diagram illustrating processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses, systems and methods consistent with aspects related to the invention as recited in the appended claims.

The methods and apparatuses disclosed herein can be used for configuring neural network processing units (NPUs) in various neural network-based architectures, such as deep neural networks (DNNs) , convolutional neural networks (CNNs) , recurrent neural networks (RNNs) , or the like.

FIG. 1 illustrates an exemplary neural network (NN) 100 in which embodiments of the present disclosure can be implemented. As depicted in FIG. 1, neural network 100 can include an input layer 120 that accepts inputs, e.g., inputs 110-1, ..., 110-m.Inputs can include an image, text, or any other structured or unstructured data for processing by neural network 100. In some embodiments, neural network 100 can accept a plurality of inputs simultaneously. For example, in FIG. 1, neural network 100 can accept up to m inputs simultaneously. Additionally or alternatively, input layer 120 can accept up to m inputs in rapid succession, e.g., such that input 110-1 is accepted by input layer 120 in one cycle, a second input is accepted by input layer 120 in a second cycle in which input layer 120 pushes data from input 110-1 to a first hidden layer, and so on. The present disclosure does not intend to limit the number of inputs, or the way of inputting, such as simultaneous input, rapid succession input, or the like.

Input layer 120 can comprise one or more nodes, e.g., nodes 120-1, 120-2, ..., 120-a. Each node can execute an activation function based on corresponding input (e.g., one or more of inputs 110-1, ..., 110-m) and scale the output from the activation function by a particular weight associated with the node. An activation function can comprise a Heaviside step function, a Gaussian function, a multiquadratic function, an inverse multiquadratic function, a sigmoidal function, a ReLU function, a Leaky ReLU function, a Tanh function, or the like. A weight can comprise a positive value between 0.0 and 1.0 or any other numerical value configured to allow some nodes in the layer to have corresponding output scaled more or less than output corresponding to other nodes in the layer. A plurality of weights can form a weight matrix.

As further depicted in FIG. 1, neural network 100 can include one or more hidden layers, e.g., hidden layers 130-1, ..., 130-n. Each hidden layer can comprise one or more nodes. For example, in FIG. 1, hidden layer 130-1 comprises nodes 130-1-1, 130-1-2, 130-1-3, ..., 130-1-b, and hidden layer 130-n comprises nodes 130-n-1, 130-n-2, 130-n-3, ..., 130-n-c. Similar to nodes of input layer 120, nodes of the hidden layers can execute activation functions based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.

As further depicted in FIG. 1, neural network 100 can include an output layer 140 that finalizes outputs, e.g., outputs 150-1, 150-2, ..., 150-d. Output layer 140 can comprise one or more nodes, e.g., nodes 140-1, 140-2, ..., 140-d. Similar to nodes of input layer 120 and of the hidden layers, nodes of output layer 140 can execute activation functions based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.

Although depicted as fully connected in FIG. 1, the layers of neural network 100 can use any connection scheme. For example, one or more layers (e.g., input layer 120, hidden layer 130-1, ..., hidden layer 130-n, output layer 140, or the like) can be connected using a convolutional scheme, a sparsely connected scheme, or the like. Such embodiments can use fewer connections between one layer and a previous layer than depicted in FIG. 1.

Moreover, although depicted as a feedforward network in FIG. 1, neural network 100 can additionally or alternatively use backpropagation, e.g., by using long short-term memory (LSTM) nodes or the like. Accordingly, although neural network 100 is depicted similar to a convolutional neural network (CNN) , neural network 100 can comprise a recurrent neural network (RNN) or any other types of neural network.

FIG. 2 illustrates an exemplary heterogeneous acceleration processing unit (HAPU) 200, according to some embodiments of the present disclosure. As shown in FIG. 2, HAPU 200 can include a plurality of cores 202 (e.g.,

cores

202a, 202b, 202c, and 202d) , an interface 204, a command parser (CP) 206, and a communication unit (CU) 208. It is appreciated that HAPU 200 can also include other components, such as a global memory (not shown) and the like. In some embodiments, HAPU 200 can be implemented as a neural network processing unit (NPU) .

Interface 204 can provide communication between HAPU 200 and external devices. For example, interface 204 can include a peripheral component interconnect express (PCI-E) interface to provide connection with a host unit (not shown in FIG. 2) . Interface 204 can also include a universal serial bus (USB) , a joint test action group (JTAG) interface, a TUN/TAP interface, and/or the like.

CP 206 can receive commands or instructions from external devices (e.g., via interface 204) , and distribute the commands to corresponding component, such as one or more cores 202 or communication unit 208. For example, CP 206 can interact with host unit (e.g., under the supervision of kernel mode driver (KMD) ) , and receive commands from the host unit. The commands can include a memory access command or a computation command. CP 206 can distribute memory access command to CU 208, and computation command to one or more cores 202.

CU 208 can be communicatively coupled with components of HAPU 200, and assist with transferring data between these components. For example, CU 208 can assist with transferring data between multiple cores 202 (e.g., cores 202a-202d) or within each core 202a-202d. CU 208 can also allow off-chip devices to access both on-chip and off-chip memory without causing an interrupt. For example, CU 208 can load data or instructions into local memory of cores 202. Thus, CU 208 can also generate memory addresses and initiate memory read or write cycles. CU 208 can also contain several hardware registers that can be written and read by the one or more cores 202, including a memory address register, a byte-count register, one or more control registers, and/or other types of registers. These registers can specify the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, and/or the number of bytes to transfer in one burst. It is appreciated that each core (e.g., core 202a) can include a sub-CU (e.g., transmission engine 2026 as shown in FIG. 2) , which can be used to transfer data within the core and across cores.

In some embodiments, CU 208 can include a direct memory access (DMA) unit (not shown) and a bus (not shown) . The bus can provide high speed cross-core communication. The bus also connects cores 202 with other units, such as the off-chip memory or peripherals.

CU 208 can also move block data among cores 202 via a bus. While a single core 202 is capable of handling a typical training or inference task, a plurality of cores 202 can work together via the bus to take on large and complex tasks (e.g., processing a neural network with a large weight matrix) .

Core 202a-202d (e.g., core 202a as shown on right side of FIG. 2) can include one or more computation engines configured to perform one or more operations based on commands, e.g., commands received from CP 206. The operation can include multiplication, addition, multiply-accumulate, convolution, element-wise operation, and the like. For example, one or more computation engines of core 202a can include a convolution unit, a pooling unit, a matrix multiplication unit, an element-wise operation (EWOP) unit, and/or the like.

As shown in FIG. 2, core 202a-202d (e.g., core 202a) can also include one or more local memories (LMs) 2022 and transmission engine 2026. Local memory 2022 can provide storage space with fast read/write speed. In some embodiments, storage space of local memory 2022 can be 250 megabytes (MB) and above, which can reduce interaction with a global memory. With local memory 2022, part or all of data access can be performed within each core 202a-202d, reducing the latency caused by data access.

Transmission engine 2026 can be included in CU 208 or in each core 202a-202d as an independent communication unit. Transmission engine 2026 can be communicatively coupled with components of core 202, e.g., local memory 2022 and computation engine 2024, and assist with transferring data or commands (or instructions) between these components. Transmission engine 2026 can also assist with communicating data or commands across cores. For example, transmission engine 2026 can transmit data from local memory 2022 or computation engine 2024 to components outside the core, e.g., CU 208, or receive data from components outside the core to local memory 2022.

In some embodiments, core 202a-202d can also include a sequencer (not shown) configured to retrieve commands and distribute the commands to other components of core. For example, the sequencer can distribute a computation command to computation engine 2024 to perform a computation, or distribute a transmission command to transmission engine 2026 to perform a transmission operation.

FIG. 3A illustrates an exemplary machine learning system 300, according to some embodiments of the present disclosure. In some embodiments, machine learning system 300 can be implemented in a computing device or a terminal. As shown in FIG. 3A, machine learning system 300 can include a host unit 302 (e.g., a central processing unit (CPU) ) , a disk 304, a host memory 306, and a HAPU 308. In some embodiments, host memory 306 can be an integral memory or an external memory associated with host unit 302. Host memory 306 can be a local or a global memory. In some embodiments, disk 304 may comprise an external memory configured to provide additional memory for host unit 302.

Host unit 302 (e.g., an X86 or ARM central processing unit) can be coupled with host memory 306 and disk 304, and configured to process general instructions. For example, an operating system (OS) , a software, an application or a program can run on host unit 302. HAPU 308 can be coupled to host unit 302 through a peripheral interface (e.g., interface 204) . As referred to herein, a HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) can be a computing device for accelerating neural network processing tasks, e.g., neural network training or inference. In some embodiments, HAPU 308 can be configured to be used as a co-processor of host unit 302.

In some embodiments, a compiler can be included in a host unit (e.g., host unit 302 of FIG. 3A) , host memory (e.g., host memory 306 of FIG. 3A) or HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) . The compiler can be configured to push one or more commands or instructions to HAPU. In some embodiments, the compiler can be implemented as a program or computer software that transforms computer codes written in one programming language into instructions for HAPU to create an executable program. In machine learning applications, a compiler can perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, initialization of a neural network, code optimization, and code generation, or combinations thereof. For example, the compiler can compile a neural network to generate static or semi-static parameters, e.g., connections among nodes (or neurons) and weights of the nodes.

As discussed above, the commands pushed into HAPU can be further distributed to corresponding components (e.g., one or more core 202 or CU 208 of FIG. 2) of HAPU by CP (e.g., CP 206 of FIG. 2) .

FIG. 3B illustrates a schematic diagram of an exemplary cloud system 310, according to some embodiments of the disclosure. The cloud system 310 can include a plurality of computing servers (e.g., computing servers 312 and 314) . In some embodiments, computing server 312 can, for example, include the machine learning system 300, which includes HAPU 308. The cloud system 310 may be connected to user devices via a network. With the assistance of HAPU 308, cloud system 310 can provide extended AI capabilities of image recognition, facial recognition, translations, 3D modeling, and the like.

It is appreciated that, however, HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) can be implemented in computing devices or terminals in various ways. For example, HAPU can be integrated in a computing device or terminal, such as a smart phone, a tablet, wearable device, or the like.

FIG. 4 is a flowchart of an exemplary method 400 for processing a neural network, according to some embodiments of the present disclosure. Method 400 can be implemented by a processing unit, such as HAPU 200 of FIG. 2 or HAPU 308 of FIGs. 3A- 3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, method 400 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.

As shown in FIG. 4, at step 401, a plurality of inputs can be transmitted to each of a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) in sequence. For example, with reference to FIG. 2, CU 208 can transmit a plurality of inputs to a plurality of cores 202 (e.g., cores 202a-202d) of HAPU 200 respectively. CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 of HAPU 200. In some embodiments, CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories of the plurality of cores 202. For example, CU 208 can load input_a, input_b, input_c and input_d to core 202a, core 202b, core 202c, and core 202d, respectively. In some embodiments, CU 208 can communicate (e.g., transfer or copy) an input from one core to another core. For example, for input_a, input_b, input_c and input_d, CU 208 can transfer input_a from core 202a to core 202d, transfer input_b from core 202b to 202a, transfer input_c from core 202c to core 202b, and transfer input_d from core 202d to core 202c. As another example, CU 208 can copy input_a from core 202a and save a copy of input_a in core 202d, copy input_b from core 202b and save a copy of input_b in 202a, copy input_c from core 202c and save a copy of input_c in core 202b, and copy input_d from core 202d and save a copy of input_d in core 202c. The HAPU may perform a plurality of rounds of communications until every input is received at each of the cores. In the case that the number of the cores is N, the HAPU may perform an initial round of loading of the inputs to respective cores of the HAPU and (N-1) rounds of communications of the current inputs in the cores to other cores of the HAPU in sequence. In some embodiments, transmission engine 2026 can assist this communication by, e.g., reading the input from local memory and transmitting it to CU 208.

The plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network. In some embodiments, the plurality of inputs can include a plurality of activations. The number of the inputs can be equal to or less than the number of the cores in the HAPU. In the case that the number of inputs is less than the number of available cores, some of the cores may not have an input.

At step 403, at each of the plurality of cores, a computation is repeatedly performed using the part of a weight matrix corresponding to the core and the input received at the core. For example, during the initial loading of the inputs or each round of communication of the inputs from other cores, each of the plurality of cores can perform a computation using the part of the weight matrix corresponding to the core and an input received (e.g., loaded from an external memory or communicated from another core) at the core. With reference to FIG. 2, each core 202 (e.g., core 202a, core 202b, core 202c or core 202d) can perform a computation using the part of the weight matrix corresponding to the core and each input loaded or communicated to the core by CU 208. Each core can perform a plurality of rounds of computations, each round with a different input. The number of the rounds of computations performed on each core can be equal to the number of inputs.

The weight matrix relates to the neural network being processed. The weight matrix can be divided into a plurality of parts. The plurality of cores each has a corresponding part of the weight matrix. The number of parts of the weight matrix can be equal to the number of cores. Each core can store a corresponding part of the weight matrix in its local memory. For example, with reference to FIG. 2, CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) .

Since each part of the weight matrix have a smaller size than the entire weight matrix, requirements for computation and storage resources can be reduced. Then, when the plurality of parts of the weight matrix are distributed to multiple cores, each core would have sufficient computation and storage resources to perform a computation with a corresponding part of the weight matrix.

In some embodiments, communication of an input to another core can be performed in parallel with current computation using this input. For example, with reference to FIG. 2, communication of input_a from core 202a to core 202d can be performed in parallel with computation on core 202a using input_a and corresponding part_a of the weight matrix, communication of input_b from core 202b to 202a can be performed in parallel with computation on core 202b using input_b and corresponding part_b of the weight matrix, and so on.

At step 405, results of computations using an input received from another core can be communicated to the core which the input is initially loaded to. For example, with reference to FIG. 2, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d. Results of computations using input_a and a part of the weight matrix stored at core 202d can be communicated by CU 208 to core 202a, results of computations using input_b and a part of the weight matrix stored at core 202a can be communicated by CU 208 to core 202b, and so on. In some embodiments, transmission engine 2026 can perform the communication by, e.g., reading the result from local memory and transmitting it to CU 208.

In some embodiments, step 405 may be omitted from method 400.

In some embodiments, step 405 can be performed in parallel with current round of computations. For example, with reference to FIG. 2, communication of a result of a computation on core 202a using input_b and part_a of the weight matrix to core 202b can be performed in parallel with computation on core 202a using input_c and part_a of the weight matrix, and so on.

By performing

steps

401, 403, 405, each of the plurality of cores performs rounds of computation using each of the inputs and the part of the weight matrix corresponding to the core. For example, referring to FIG. 2, each of the input_a, input_b, input_c and input_d is computed with each part of the weight matrix corresponding to the cores 202a-202d. After each of the plurality of inputs is used by each of the plurality of cores for computation, the method 400 may proceed to step 407.

At step 407, results of the computations can be output. The results can include computation results using all inputs and all parts of the weight matrix.

FIG. 5 illustrates a flowchart of another exemplary method 500 for processing a neural network, according to some embodiments of the present disclosure. Method 500 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, method 500 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.

At step 501, as shown in FIG. 5, a plurality of inputs can be loaded onto a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) . For example, as discussed above with reference to FIG. 2, CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 (e.g., cores 202a-d) of HAPU 200. In some embodiments, CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories 2024 of the plurality of cores 202. For example, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d.

The plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network. In some embodiments, the plurality of inputs can include a plurality of activation functions. A number of the inputs can be equal to or less than a number of the cores. In the case that the number of inputs is less than the number of cores, some of the plurality of cores do not have an input.

At step 503, at each core of the plurality of cores, a computation can be performed using corresponding part of a weight matrix and an input loaded onto the core. For example, each of the plurality of cores can perform a computation using the corresponding part of the weight matrix and an input loaded to the core. The weight matrix relates to the neural network under processing. The weight matrix can be divided into a plurality of parts. The plurality of cores can each have a corresponding part of the weight matrix. The number of parts of the weight matrix can be equal to the number of cores. Each core can store a corresponding part of the weight matrix in its local memory. For example, with reference to FIG. 2, CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) . Each core 202 (e.g., core 202a, core 202b, core 202c or core 202d) can perform a computation using the corresponding part (e.g., part_a, part_b, part_c or part_d) of the weight matrix and each input loaded to the core.

At step 505, an input on one core can be communicated to another core. In some embodiments, the input is sequentially communicated to another core. For example, with reference to FIG. 2, CU 208 can sequentially communicate input_a from core 202a to core 202d, input_b from core 202b to 202a, input_c from core 202c to core 202b, and input_d from core 202d to core 202c. In some embodiments, transmission engine 2026 can assist this communication. For example, transmission engine 2026 on core 202a can reading the input_a from local memory 2022 and transmitting it to CU 208.

At step 507, at each core of the plurality of cores, a computation can be performed using corresponding part of the weight matrix and an input communicated to the core. For example, with reference to FIG. 2, core 202a can perform a computation using input_b and part_a of the weight matrix, core 202b can perform a computation using input_c and part_b of the weight matrix, and so on.

At step 509, a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to in step 501. For example, with reference to FIG. 2, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d. Results of computations using input_a and part_b, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202a, results of computations using input_b and part_a, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202b, and so on. The communication of a result of computation can be performed in parallel with next round of computation. For example, with reference to FIG. 2, communication of a result of a computation on core 202a using input_b and part_a of the weight matrix to core 202b can be performed in parallel with computation on core 202a using input_c and part_a of the weight matrix, and so on. In some embodiments, step 507 may be omitted from method 500.

At step 511, whether every input has been circulated through each of the plurality of cores can be determined. If not (e.g., indicated by NO in FIG. 5) , method 500 proceeds back to step 505, and performs another round of computations and communications. At step 505, an input on one core can be communicated to another core. The communication of the input can be performed in parallel with the computation using the input. At step 507, each core can perform another computation using corresponding part of the weight matrix and an input communicated to the core. In some embodiments, at step 509, a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to. The communication of the result of the computation can be performed in parallel with next round of computation. For example, with reference to FIG. 2, at step 505, CU 208 can communicate input_b from core 202a to core 202d, input_c from core 202b to core 202a, input_d from core 202c to core 202b, and input_a from core 202d to core 202c. At step 507, core 202a can perform a computation using input_c and part_a of the weight matrix, core 202b can perform a computation using input_d and part_b of the weight matrix, and so on. At step 509, the result of computation on core 202a using input_c and part_a of the weight matrix can be communicated to core 202c, the result of computation on core 202b using input_d and part_b of the weight matrix can be communicated to core 202d, and so on.

Method 500 can include a plurality of rounds of communications and computations (e.g., steps 505 and 507) until every input goes through each of the cores. In some embodiments, communication of an input can be performed in parallel with current computations using this input. For example, with reference to FIG. 2, communication of input_b from core 202a to core 202d can be performed in parallel with computation on core 202a using input_b and part_a of the weight matrix, communication of input_c from core 202b to 202a can be performed in parallel with computation on core 202b using input_c and part_b of the weight matrix, and so on.

If every input has been circulated through each of the plurality of cores (e.g., indicated by YES in FIG. 5) , method 500 proceeds to step 513. At step 513, results of the computations can be output. The results can include computation results using each of the inputs and each part of the weight matrix corresponding to the plurality of cores.

FIG. 6 is a schematic diagram illustrating an exemplary process 600 of processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure. It is appreciated that process 600 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, process 600 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.

As shown in FIG. 6, the HAPU can include four cores, core_0, core_1, core_2 and core_3. Each core can be associated with (e. g, store) a corresponding part of the weight matrix. A weight matrix can be divided into four parts, w0, w1, w2 and w3, which are distributed to core_0, core_1, core_2 and core_3, respectively. For example, a core can store its corresponding part of the weight matrix in local memory. It is appreciated that while four cores and four parts of the weight matrix are shown, the HAPU can include more or less cores and the weight matrix can include more or less parts. In some embodiments, a number of parts of the weight matrix can be equal to a number of cores on the HAPU. In some other embodiments, the number of parts of the weight matrix can be less than the number of cores on the HAPU. In this case, some of cores on the HAPU have no corresponding parts of the weight matrix.

A plurality of inputs, e.g., b0, b1, b2 and b3 as shown in FIG. 6, are loaded onto the plurality of cores on the HAPU, e.g., core_0, core_1, core_2 and core_3, respectively. In some embodiments, the number of inputs can be equal to the number of cores on the HAPU each having a part of weight matrix. In some other embodiments, the number of inputs can be less than the number of cores on the HAPU each having a part of weight matrix.

At time t0, each core can perform a first round of computation using an input on the core and a part of the weight matrix corresponding to the core. For example, core_0 can perform a first round of computation using an input b0 on the core_0 and w0 of the weight matrix, core_1 can perform a first round of computation using an input b1 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of this round of computation shown as b0/w0, b1/w1, b2/w2 or b3/w3 in FIG. 6) in its local memory. In some embodiments, each core can also store the result of this round of computation (e.g., b0/w0, b1/w1, b2/w2 and b3/w3) at a correspond address in an output (e.g., output_0, output_1, output_2 or output_3) .

In addition, an input on one core can be communicated to another core, for example, in a sequential order. With reference to FIG. 2, CU 208 can perform the communication with assistance of transmission engine 2026. For example, transmission engine 2026 can transmit or read an input from the local memory 2022 to CU 208 which communicate it to another core. As shown in FIG. 6, input b0 can be communicated from core_0 to core_3, input b1 can be communicated to from core_1 core_0, input b2 can be communicated from core_2 to core_1, and input b3 can be communicated from core_3 to core 2. The communication of an input can be performed in parallel with the computation on the core using this input.

At time t1, each core can perform a second round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a second round of computation using an input b1 on the core_0 and w0 of the weight matrix, core_1 can perform a second round of computation using an input b2 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the second round of computation (shown as b1/w0, b2/w1, b3/w2 and b0/w3 in FIG. 6) in its local memory.

In addition, a second round of sequential communication of an input on one core to another core can be performed. As shown in FIG. 6, input b1 on core_0 can be communicated to core_3, input b2 on core_1 can be communicated to core_0, input b3 on core_2 can be communicated to core_1, and input b0 on core_3 can be communicated to core 2. The second round of communication of an input can also be performed in parallel with the second round of computation on the core using this input.

At time t2, each core can perform a third round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a third round of computation using an input b2 on the core_0 and w0 of the weight matrix, core_1 can perform a third round of computation using an input b3 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the third round of computation (shown as b2/w0, b3/w1, b0/w2 and b1/w3 in FIG. 6) in its local memory.

In addition, a third round of communication of an input on one core to another core can be performed. As shown in FIG. 6, input b2 on core_0 can be communicated to core_3, input b3 on core_1 can be communicated to core_0, input b0 on core_2 can be communicated to core_1, and input b1 on core_3 can be communicated to core 2. The third round of communication of an input can also be performed in parallel with the third round of computation on the core using the input.

In some embodiments, a result of previous round (e.g., second round) of computation can be communicated to the core which the input is initially loaded to. With reference to FIG. 2, CU 208 can perform the communication of the result with assistance of transmission engine 2026. For example, transmission engine 2026 can transmit the result from the local memory 2022 to CU 208 which communicates it to the corresponding core. As shown in the shaded blocks of FIG. 6, result b1/w0 on core_0 can be communicated to core_1, result b2/w1 on core_1 can be communicated to core_2, result b3/w2 on core_2 can be communicated to core_3, and result b0/w3 on core_3 can be communicated to core_0. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) . In some embodiments, the communication of the result of previous round (e.g., second round) of computation can be performed in parallel with current round (e.g., third round) of computation.

At time t3, each core can perform a fourth round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a fourth round of computation using an input b3 on the core_0 and w0 of the weight matrix, core_1 can perform a fourth round of computation using an input b0 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the fourth round of computation (shown as b3/w0, b0/w1, b1/w2, and b2/w3 in FIG. 6) in its local memory.

In some embodiments, a result of previous round (e.g., third round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to. As shown in FIG. 6, result b2/w0 on core_0 can be communicated to core_2, result b3/w1 on core_1 can be communicated to core_3, result b0/w2 on core_2 can be communicated to core_0, and result b1/w3 on core_3 can be communicated to core_1. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) . In some embodiments, the communication of the result of previous round (e.g., third round) of computation can be performed in parallel with current round (e.g., fourth round) of computation.

In some embodiments, a result of the final round (e.g., fourth round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to. As shown in FIG. 6, result b3/w0 on core_0 can be communicated to core_3, result b0/w1 on core_1 can be communicated to core_0, result b1/w2 on core_2 can be communicated to core_1, and result b2/w3 on core_3 can be communicated to core_2. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) .

In some embodiments, outputs (e.g., output_0, output_1, output_2, and output_3) can be provided to other components of the HAPU or neural network.

Embodiments of the disclosure can bring many technical advantages. For example, in some embodiments of the disclosure, a plurality of cores can each have a part of, rather than the entire, weight matrix, and can perform parallel computations using parts of the weight matrix and multiple inputs. Some embodiments of the disclosure can provide fast communication of data (e.g., inputs or results of computations) across cores, and perform the communication in parallel with computation, which can significantly reduce time for processing a neural network.

Embodiments of the disclosure can be applied to many products, environments, and scenarios. For example, some embodiments of the disclosure can be applied to Ali-NPU (e.g., Hanguang NPU) , Ali-Cloud, Ali-DAU (Database Acceleration Unit) , Ali-AI platform, GPU, TPU, or the like.

The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

The embodiments may further be described using the following clauses:

1. A method for processing a neural network, comprising:

receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;

receiving a first input of the plurality of inputs at a core of the plurality of cores;

performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

communicating the first input from the core to another core of the plurality of cores.

2. The method according to clause 1, further comprising:

receiving, at the core, a second input from yet another core of the plurality of cores;

performing, at the core, a second computation using the first part of the weight matrix and the second input; and

communicating a result of the second computation from the core to the yet another core.

3. The method according to clause 1 or clause 2, wherein:

communicating the first input is performed in parallel with the first computation.

4. The method according to any of clauses 1-3, further comprising: communicating the second input from the core to the another core.

5. The method according to clause 4, wherein communicating the second input is performed in parallel with the second computation.

6. The method according to any of clauses 1-4, further comprising:

performing, at the another core, a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.

7. The method according to any of clauses 1-6, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.

8. The method according to any of clauses 1-7, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.

9. The method according to any of clauses 1-8, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.

10. The method according to any of clauses 1-9, wherein the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .

11. A heterogeneous acceleration processing unit (HAPU) , comprising:

a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:

receive a first input of a plurality of inputs; and

perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.

12. The heterogeneous acceleration processing unit according to clause 11, wherein the core is configured to:

receive a second input from yet another core of the plurality of cores; and

perform a second computation using the first part of the weight matrix and the second input, and

wherein the communication unit is configured to communicate a result of the second computation from the core to the yet another core.

13. The heterogeneous acceleration processing unit according to clauses 11 or 12, wherein the communication of the first input from the core to the another core by the communication unit is performed in parallel with the first computation by the core.

14. The heterogeneous acceleration processing unit according to any of clauses 11-13, wherein the communication unit is configured to communicate the second input from the first core to the another core.

15. The heterogeneous acceleration processing unit according to clause 14, wherein the communication of the second input from the core to the another core by the communication unit is performed in parallel with the second computation by the core.

16. The heterogeneous acceleration processing unit according to any of clauses 11-15, wherein the another core is configured to perform a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.

17. The heterogeneous acceleration processing unit according to any of clauses 11-16, wherein the communication unit is configured to load the first input from an external memory to the core.

18. The heterogeneous acceleration processing unit according to any of clauses 11-17, wherein the core is configured to receive each of the plurality of inputs at a plurality of different time instances, and perform a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core is configured to use a different input of the plurality of inputs.

19. The heterogeneous acceleration processing unit according to any of clauses 11-18, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.

20. The heterogeneous acceleration processing unit according to any of clauses 11-19, wherein the core comprises:

a local memory for storing the first part of the weight matrix and a result of the first computation;

at least one computation engine communicatively coupled with the local memory and configured to perform the first computation; and

a transmission engine communicatively coupled with the local memory and configured to transmit the first input.

21. A non-transitory computer readable storage media storing a set of instructions that are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising:

receiving a plurality of inputs;

receiving a first input of the plurality of inputs at a core of the plurality of cores, wherein a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;

22. The non-transitory computer readable storage media according to clause 21, wherein the method further comprises:

23. The non-transitory computer readable storage media according to clause 21 or clause 22, wherein:

24. The non-transitory computer readable storage media according to any of clauses 21-23, wherein the method further comprises:

communicating the second input from the core to the another core.

25. The non-transitory computer readable storage media according to clause 24, wherein communicating the second input is performed in parallel with the second computation.

26. The non-transitory computer readable storage media according to any of clauses 21-25, wherein the method further comprises:

27. The non-transitory computer readable storage media according to any of clauses 21-26, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.

28. The non-transitory computer readable storage media according to any of clauses 21-27, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.

29. The non-transitory computer readable storage media according to any of clauses 21-28, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.

30. The non-transitory computer readable storage media according to any of clauses 21-29, wherein the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .

31. A terminal, comprising:

a host unit; and

a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit, comprising:

receive a first input of a plurality of inputs; and

The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to precise forms or embodiments disclosed. Modifications and adaptations of the embodiments will be apparent from consideration of the specification and practice of the disclosed embodiments. For example, the described implementations include hardware, but systems and methods consistent with the present disclosure can be implemented with hardware and software. In addition, while certain components have been described as being coupled to one another, such components may be integrated with one another or distributed in any suitable fashion.

Moreover, while illustrative embodiments have been described herein, the scope includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments) , adaptations or alterations based on the present disclosure. The elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as nonexclusive. Further, the steps of the disclosed methods can be modified in any manner, including reordering steps and/or inserting or deleting steps.

The features and advantages of the disclosure are apparent from the detailed specification, and thus, it is intended that the appended claims cover all systems and methods falling within the true spirit and scope of the disclosure. As used herein, the indefinite articles “a” and “an” mean “one or more. ” Similarly, the use of a plural term does not necessarily denote a plurality unless it is unambiguous in the given context. Further, since numerous modifications and variations will readily occur from studying the present disclosure, it is not desired to limit the disclosure to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the disclosure.

As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

Other embodiments will be apparent from consideration of the specification and practice of the embodiments disclosed herein. It is intended that the specification and examples be considered as example only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.

Claims

A method for processing a neural network, comprising:

receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;

receiving a first input of the plurality of inputs at a core of the plurality of cores;

performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

communicating the first input from the core to another core of the plurality of cores.
The method according to claim 1, further comprising:

receiving, at the core, a second input from yet another core of the plurality of cores;

performing, at the core, a second computation using the first part of the weight matrix and the second input; and

communicating a result of the second computation from the core to the yet another core.
The method according to claim 1, wherein:

communicating the first input is performed in parallel with the first computation.
The method according to claim 1, further comprising:

communicating the second input from the core to the another core.
The method according to claim 4, wherein communicating the second input is performed in parallel with the second computation.
The method according to claim 1, further comprising:

performing, at the another core, a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
The method according to claim 1, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.
The method according to claim 1, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
A heterogeneous acceleration processing unit (HAPU) , comprising:

a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:

receive a first input of a plurality of inputs; and

perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
The heterogeneous acceleration processing unit according to claim 9, wherein the core is configured to:

receive a second input from yet another core of the plurality of cores; and

perform a second computation using the first part of the weight matrix and the second input, and

wherein the communication unit is configured to communicate a result of the second computation from the core to the yet another core.
The heterogeneous acceleration processing unit according to claim 9, wherein the communication of the first input from the core to the another core by the communication unit is performed in parallel with the first computation by the core.
The heterogeneous acceleration processing unit according to claim 10, wherein the communication unit is configured to communicate the second input from the core to the another core.
The heterogeneous acceleration processing unit according to claim 12, wherein the communication of the second input from the core to the another core by the communication unit is performed in parallel with the second computation by the core.
The heterogeneous acceleration processing unit according to claim 9, wherein the another core is configured to perform a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
The heterogeneous acceleration processing unit according to claim 9, wherein the communication unit is configured to load the first input from an external memory to the core.
The heterogeneous acceleration processing unit according to claim 9, wherein the core is configured to receive each of the plurality of inputs at a plurality of different time instances, and perform a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core is configured to use a different input of the plurality of inputs.
The heterogeneous acceleration processing unit according to claim 9, wherein the core comprises:

a local memory for storing the first part of the weight matrix and a result of the first computation;

at least one computation engine communicatively coupled with the local memory and configured to perform the first computation; and

a transmission engine communicatively coupled with the local memory and configured to transmit the first input.
A non-transitory computer readable storage media storing a set of instructions that are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising:

receiving a plurality of inputs;

receiving a first input of the plurality of inputs at a core of the plurality of cores, wherein a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;

performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

communicating the first input from the core to another core of the plurality of cores.
The non-transitory computer readable storage media according to claim 18, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
A terminal, comprising:

a host unit; and

a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit, comprising:

a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:

receive a first input of a plurality of inputs; and

perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and

a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.