WO2021138842A1 - Methods and apparatuses for processing neural network - Google Patents

Methods and apparatuses for processing neural network Download PDF

Info

Publication number
WO2021138842A1
WO2021138842A1 PCT/CN2020/070943 CN2020070943W WO2021138842A1 WO 2021138842 A1 WO2021138842 A1 WO 2021138842A1 CN 2020070943 W CN2020070943 W CN 2020070943W WO 2021138842 A1 WO2021138842 A1 WO 2021138842A1
Authority
WO
WIPO (PCT)
Prior art keywords
core
input
computation
cores
weight matrix
Prior art date
Application number
PCT/CN2020/070943
Other languages
French (fr)
Inventor
Yang Jiao
Yongquan ZHOU
Jun He
Original Assignee
Alibaba Group Holding Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Limited filed Critical Alibaba Group Holding Limited
Priority to PCT/CN2020/070943 priority Critical patent/WO2021138842A1/en
Publication of WO2021138842A1 publication Critical patent/WO2021138842A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • a neural network In machine learning (ML) or deep learning (DL) , a neural network (NN) is a mechanism that basically mimics how a human brain learns.
  • a deep neural network is a category of neural networks. Over the years, neural networks (e.g., DNNs) have demonstrated successes in various domains such as computer vision, natural language processing and the like.
  • neural networks have a weight matrix of large size, which requires significant computational and storage resources for neural network training or deployment.
  • Some techniques have been developed to process neural networks with weight matrix of large size on multi-core processing units. For example, one solution is to utilize a level-2 shared memory (i.e., sharing memory across multiple processors) to expand the storage space. But this solution is complicated, difficult to be managed, and would significantly increase communication delay (e.g., read, write, or transmission delay) .
  • an exemplary method for processing a neural network comprising: receiving a plurality of inputs at a processing unit, the processing unit including a plurality of cores, and a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; receiving a first input of the plurality of inputs at a core of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.
  • an exemplary heterogeneous acceleration processing unit can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores.
  • a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores.
  • a core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core.
  • the communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.
  • an exemplary terminal can include a host unit and a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit.
  • the HAPU can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores.
  • a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores.
  • a core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core.
  • the communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.
  • an exemplary non-transitory computer readable storage media stores a set of instructions.
  • the instructions are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising: receiving a plurality of inputs; receiving a first input of the plurality of inputs at a core of the plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.
  • FIG. 1 is a schematic diagram of an exemplary neural network, according to some embodiments of the present disclosure.
  • FIG. 2 is a block diagram of an exemplary heterogeneous acceleration processing unit (HAPU) , according to some embodiments of the present disclosure.
  • HAPU heterogeneous acceleration processing unit
  • FIG. 3A is a block diagram of an exemplary machine learning system, according to some embodiments of the present disclosure.
  • FIG. 3B is a schematic diagram of an exemplary cloud system, according to some embodiments of the present disclosure.
  • FIG. 4 is a flowchart of an exemplary method for processing a neural network, according to some embodiments of the present disclosure.
  • FIG. 5 is a flowchart of another exemplary method for processing a neural network, according to some embodiments of the present disclosure.
  • FIG. 6 is a schematic diagram illustrating processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure.
  • NPUs neural network processing units
  • DNNs deep neural networks
  • CNNs convolutional neural networks
  • RNNs recurrent neural networks
  • FIG. 1 illustrates an exemplary neural network (NN) 100 in which embodiments of the present disclosure can be implemented.
  • neural network 100 can include an input layer 120 that accepts inputs, e.g., inputs 110-1, ..., 110-m.Inputs can include an image, text, or any other structured or unstructured data for processing by neural network 100.
  • neural network 100 can accept a plurality of inputs simultaneously. For example, in FIG. 1, neural network 100 can accept up to m inputs simultaneously.
  • input layer 120 can accept up to m inputs in rapid succession, e.g., such that input 110-1 is accepted by input layer 120 in one cycle, a second input is accepted by input layer 120 in a second cycle in which input layer 120 pushes data from input 110-1 to a first hidden layer, and so on.
  • the present disclosure does not intend to limit the number of inputs, or the way of inputting, such as simultaneous input, rapid succession input, or the like.
  • Input layer 120 can comprise one or more nodes, e.g., nodes 120-1, 120-2, ..., 120-a. Each node can execute an activation function based on corresponding input (e.g., one or more of inputs 110-1, ..., 110-m) and scale the output from the activation function by a particular weight associated with the node.
  • An activation function can comprise a Heaviside step function, a Gaussian function, a multiquadratic function, an inverse multiquadratic function, a sigmoidal function, a ReLU function, a Leaky ReLU function, a Tanh function, or the like.
  • a weight can comprise a positive value between 0.0 and 1.0 or any other numerical value configured to allow some nodes in the layer to have corresponding output scaled more or less than output corresponding to other nodes in the layer.
  • a plurality of weights can form a weight matrix.
  • neural network 100 can include one or more hidden layers, e.g., hidden layers 130-1, ..., 130-n.
  • Each hidden layer can comprise one or more nodes.
  • hidden layer 130-1 comprises nodes 130-1-1, 130-1-2, 130-1-3, ..., 130-1-b
  • hidden layer 130-n comprises nodes 130-n-1, 130-n-2, 130-n-3, ..., 130-n-c.
  • nodes of the hidden layers can execute activation functions based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.
  • neural network 100 can include an output layer 140 that finalizes outputs, e.g., outputs 150-1, 150-2, ..., 150-d.
  • Output layer 140 can comprise one or more nodes, e.g., nodes 140-1, 140-2, ..., 140-d. Similar to nodes of input layer 120 and of the hidden layers, nodes of output layer 140 can execute activation functions based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.
  • the layers of neural network 100 can use any connection scheme.
  • one or more layers e.g., input layer 120, hidden layer 130-1, ..., hidden layer 130-n, output layer 140, or the like
  • Such embodiments can use fewer connections between one layer and a previous layer than depicted in FIG. 1.
  • neural network 100 can additionally or alternatively use backpropagation, e.g., by using long short-term memory (LSTM) nodes or the like.
  • LSTM long short-term memory
  • neural network 100 is depicted similar to a convolutional neural network (CNN)
  • CNN convolutional neural network
  • RNN recurrent neural network
  • FIG. 2 illustrates an exemplary heterogeneous acceleration processing unit (HAPU) 200, according to some embodiments of the present disclosure.
  • HAPU 200 can include a plurality of cores 202 (e.g., cores 202a, 202b, 202c, and 202d) , an interface 204, a command parser (CP) 206, and a communication unit (CU) 208.
  • cores 202 e.g., cores 202a, 202b, 202c, and 202d
  • CP command parser
  • CU communication unit
  • HAPU 200 can also include other components, such as a global memory (not shown) and the like.
  • HAPU 200 can be implemented as a neural network processing unit (NPU) .
  • NPU neural network processing unit
  • Interface 204 can provide communication between HAPU 200 and external devices.
  • interface 204 can include a peripheral component interconnect express (PCI-E) interface to provide connection with a host unit (not shown in FIG. 2) .
  • PCI-E peripheral component interconnect express
  • Interface 204 can also include a universal serial bus (USB) , a joint test action group (JTAG) interface, a TUN/TAP interface, and/or the like.
  • USB universal serial bus
  • JTAG joint test action group
  • TUN/TAP interface and/or the like.
  • CP 206 can receive commands or instructions from external devices (e.g., via interface 204) , and distribute the commands to corresponding component, such as one or more cores 202 or communication unit 208.
  • CP 206 can interact with host unit (e.g., under the supervision of kernel mode driver (KMD) ) , and receive commands from the host unit.
  • the commands can include a memory access command or a computation command.
  • CP 206 can distribute memory access command to CU 208, and computation command to one or more cores 202.
  • CU 208 can be communicatively coupled with components of HAPU 200, and assist with transferring data between these components. For example, CU 208 can assist with transferring data between multiple cores 202 (e.g., cores 202a-202d) or within each core 202a-202d. CU 208 can also allow off-chip devices to access both on-chip and off-chip memory without causing an interrupt. For example, CU 208 can load data or instructions into local memory of cores 202. Thus, CU 208 can also generate memory addresses and initiate memory read or write cycles.
  • cores 202 e.g., cores 202a-202d
  • CU 208 can also allow off-chip devices to access both on-chip and off-chip memory without causing an interrupt.
  • CU 208 can load data or instructions into local memory of cores 202.
  • CU 208 can also generate memory addresses and initiate memory read or write cycles.
  • CU 208 can also contain several hardware registers that can be written and read by the one or more cores 202, including a memory address register, a byte-count register, one or more control registers, and/or other types of registers. These registers can specify the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, and/or the number of bytes to transfer in one burst. It is appreciated that each core (e.g., core 202a) can include a sub-CU (e.g., transmission engine 2026 as shown in FIG. 2) , which can be used to transfer data within the core and across cores.
  • sub-CU e.g., transmission engine 2026 as shown in FIG. 2
  • CU 208 can include a direct memory access (DMA) unit (not shown) and a bus (not shown) .
  • the bus can provide high speed cross-core communication.
  • the bus also connects cores 202 with other units, such as the off-chip memory or peripherals.
  • CU 208 can also move block data among cores 202 via a bus. While a single core 202 is capable of handling a typical training or inference task, a plurality of cores 202 can work together via the bus to take on large and complex tasks (e.g., processing a neural network with a large weight matrix) .
  • Core 202a-202d can include one or more computation engines configured to perform one or more operations based on commands, e.g., commands received from CP 206.
  • the operation can include multiplication, addition, multiply-accumulate, convolution, element-wise operation, and the like.
  • one or more computation engines of core 202a can include a convolution unit, a pooling unit, a matrix multiplication unit, an element-wise operation (EWOP) unit, and/or the like.
  • core 202a-202d can also include one or more local memories (LMs) 2022 and transmission engine 2026.
  • Local memory 2022 can provide storage space with fast read/write speed.
  • storage space of local memory 2022 can be 250 megabytes (MB) and above, which can reduce interaction with a global memory.
  • MB megabytes
  • Transmission engine 2026 can be included in CU 208 or in each core 202a-202d as an independent communication unit. Transmission engine 2026 can be communicatively coupled with components of core 202, e.g., local memory 2022 and computation engine 2024, and assist with transferring data or commands (or instructions) between these components. Transmission engine 2026 can also assist with communicating data or commands across cores. For example, transmission engine 2026 can transmit data from local memory 2022 or computation engine 2024 to components outside the core, e.g., CU 208, or receive data from components outside the core to local memory 2022.
  • core 202a-202d can also include a sequencer (not shown) configured to retrieve commands and distribute the commands to other components of core.
  • the sequencer can distribute a computation command to computation engine 2024 to perform a computation, or distribute a transmission command to transmission engine 2026 to perform a transmission operation.
  • FIG. 3A illustrates an exemplary machine learning system 300, according to some embodiments of the present disclosure.
  • machine learning system 300 can be implemented in a computing device or a terminal.
  • machine learning system 300 can include a host unit 302 (e.g., a central processing unit (CPU) ) , a disk 304, a host memory 306, and a HAPU 308.
  • host memory 306 can be an integral memory or an external memory associated with host unit 302.
  • Host memory 306 can be a local or a global memory.
  • disk 304 may comprise an external memory configured to provide additional memory for host unit 302.
  • Host unit 302 e.g., an X86 or ARM central processing unit
  • Host unit 302 can be coupled with host memory 306 and disk 304, and configured to process general instructions.
  • OS operating system
  • HAPU 308 can be coupled to host unit 302 through a peripheral interface (e.g., interface 204) .
  • a HAPU e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B
  • HAPU 308 can be configured to be used as a co-processor of host unit 302.
  • a compiler can be included in a host unit (e.g., host unit 302 of FIG. 3A) , host memory (e.g., host memory 306 of FIG. 3A) or HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) .
  • the compiler can be configured to push one or more commands or instructions to HAPU.
  • the compiler can be implemented as a program or computer software that transforms computer codes written in one programming language into instructions for HAPU to create an executable program.
  • a compiler can perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, initialization of a neural network, code optimization, and code generation, or combinations thereof.
  • the compiler can compile a neural network to generate static or semi-static parameters, e.g., connections among nodes (or neurons) and weights of the nodes.
  • the commands pushed into HAPU can be further distributed to corresponding components (e.g., one or more core 202 or CU 208 of FIG. 2) of HAPU by CP (e.g., CP 206 of FIG. 2) .
  • FIG. 3B illustrates a schematic diagram of an exemplary cloud system 310, according to some embodiments of the disclosure.
  • the cloud system 310 can include a plurality of computing servers (e.g., computing servers 312 and 314) .
  • computing server 312 can, for example, include the machine learning system 300, which includes HAPU 308.
  • the cloud system 310 may be connected to user devices via a network. With the assistance of HAPU 308, cloud system 310 can provide extended AI capabilities of image recognition, facial recognition, translations, 3D modeling, and the like.
  • HAPU e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B
  • HAPU can be implemented in computing devices or terminals in various ways.
  • HAPU can be integrated in a computing device or terminal, such as a smart phone, a tablet, wearable device, or the like.
  • FIG. 4 is a flowchart of an exemplary method 400 for processing a neural network, according to some embodiments of the present disclosure.
  • Method 400 can be implemented by a processing unit, such as HAPU 200 of FIG. 2 or HAPU 308 of FIGs. 3A- 3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B.
  • method 400 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
  • a plurality of inputs can be transmitted to each of a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) in sequence.
  • a HAPU e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B
  • CU 208 can transmit a plurality of inputs to a plurality of cores 202 (e.g., cores 202a-202d) of HAPU 200 respectively.
  • CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 of HAPU 200.
  • CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories of the plurality of cores 202.
  • a command e.g., a memory access command
  • CU 208 can load input_a, input_b, input_c and input_d to core 202a, core 202b, core 202c, and core 202d, respectively.
  • CU 208 can communicate (e.g., transfer or copy) an input from one core to another core.
  • CU 208 can transfer input_a from core 202a to core 202d, transfer input_b from core 202b to 202a, transfer input_c from core 202c to core 202b, and transfer input_d from core 202d to core 202c.
  • CU 208 can copy input_a from core 202a and save a copy of input_a in core 202d, copy input_b from core 202b and save a copy of input_b in 202a, copy input_c from core 202c and save a copy of input_c in core 202b, and copy input_d from core 202d and save a copy of input_d in core 202c.
  • the HAPU may perform a plurality of rounds of communications until every input is received at each of the cores.
  • the HAPU may perform an initial round of loading of the inputs to respective cores of the HAPU and (N-1) rounds of communications of the current inputs in the cores to other cores of the HAPU in sequence.
  • transmission engine 2026 can assist this communication by, e.g., reading the input from local memory and transmitting it to CU 208.
  • the plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network.
  • the plurality of inputs can include a plurality of activations.
  • the number of the inputs can be equal to or less than the number of the cores in the HAPU. In the case that the number of inputs is less than the number of available cores, some of the cores may not have an input.
  • a computation is repeatedly performed using the part of a weight matrix corresponding to the core and the input received at the core.
  • each of the plurality of cores can perform a computation using the part of the weight matrix corresponding to the core and an input received (e.g., loaded from an external memory or communicated from another core) at the core.
  • each core 202 e.g., core 202a, core 202b, core 202c or core 202d
  • Each core can perform a plurality of rounds of computations, each round with a different input. The number of the rounds of computations performed on each core can be equal to the number of inputs.
  • the weight matrix relates to the neural network being processed.
  • the weight matrix can be divided into a plurality of parts.
  • the plurality of cores each has a corresponding part of the weight matrix.
  • the number of parts of the weight matrix can be equal to the number of cores.
  • Each core can store a corresponding part of the weight matrix in its local memory.
  • CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) .
  • each part of the weight matrix have a smaller size than the entire weight matrix, requirements for computation and storage resources can be reduced. Then, when the plurality of parts of the weight matrix are distributed to multiple cores, each core would have sufficient computation and storage resources to perform a computation with a corresponding part of the weight matrix.
  • communication of an input to another core can be performed in parallel with current computation using this input.
  • communication of input_a from core 202a to core 202d can be performed in parallel with computation on core 202a using input_a and corresponding part_a of the weight matrix
  • communication of input_b from core 202b to 202a can be performed in parallel with computation on core 202b using input_b and corresponding part_b of the weight matrix, and so on.
  • results of computations using an input received from another core can be communicated to the core which the input is initially loaded to.
  • CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d.
  • Results of computations using input_a and a part of the weight matrix stored at core 202d can be communicated by CU 208 to core 202a
  • results of computations using input_b and a part of the weight matrix stored at core 202a can be communicated by CU 208 to core 202b, and so on.
  • transmission engine 2026 can perform the communication by, e.g., reading the result from local memory and transmitting it to CU 208.
  • step 405 may be omitted from method 400.
  • step 405 can be performed in parallel with current round of computations. For example, with reference to FIG. 2, communication of a result of a computation on core 202a using input_b and part_a of the weight matrix to core 202b can be performed in parallel with computation on core 202a using input_c and part_a of the weight matrix, and so on.
  • each of the plurality of cores performs rounds of computation using each of the inputs and the part of the weight matrix corresponding to the core. For example, referring to FIG. 2, each of the input_a, input_b, input_c and input_d is computed with each part of the weight matrix corresponding to the cores 202a-202d. After each of the plurality of inputs is used by each of the plurality of cores for computation, the method 400 may proceed to step 407.
  • results of the computations can be output.
  • the results can include computation results using all inputs and all parts of the weight matrix.
  • FIG. 5 illustrates a flowchart of another exemplary method 500 for processing a neural network, according to some embodiments of the present disclosure.
  • Method 500 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B.
  • method 500 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
  • a plurality of inputs can be loaded onto a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) .
  • a HAPU e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B
  • CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 (e.g., cores 202a-d) of HAPU 200.
  • CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories 2024 of the plurality of cores 202.
  • a command e.g., a memory access command
  • CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d.
  • the plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network.
  • the plurality of inputs can include a plurality of activation functions.
  • a number of the inputs can be equal to or less than a number of the cores. In the case that the number of inputs is less than the number of cores, some of the plurality of cores do not have an input.
  • a computation can be performed using corresponding part of a weight matrix and an input loaded onto the core.
  • each of the plurality of cores can perform a computation using the corresponding part of the weight matrix and an input loaded to the core.
  • the weight matrix relates to the neural network under processing.
  • the weight matrix can be divided into a plurality of parts.
  • the plurality of cores can each have a corresponding part of the weight matrix.
  • the number of parts of the weight matrix can be equal to the number of cores.
  • Each core can store a corresponding part of the weight matrix in its local memory. For example, with reference to FIG.
  • CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) .
  • Each core 202 e.g., core 202a, core 202b, core 202c or core 202d
  • an input on one core can be communicated to another core.
  • the input is sequentially communicated to another core.
  • CU 208 can sequentially communicate input_a from core 202a to core 202d, input_b from core 202b to 202a, input_c from core 202c to core 202b, and input_d from core 202d to core 202c.
  • transmission engine 2026 can assist this communication.
  • transmission engine 2026 on core 202a can reading the input_a from local memory 2022 and transmitting it to CU 208.
  • a computation can be performed using corresponding part of the weight matrix and an input communicated to the core.
  • core 202a can perform a computation using input_b and part_a of the weight matrix
  • core 202b can perform a computation using input_c and part_b of the weight matrix, and so on.
  • a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to in step 501.
  • CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d.
  • Results of computations using input_a and part_b, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202a
  • results of computations using input_b and part_a, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202b, and so on.
  • the communication of a result of computation can be performed in parallel with next round of computation.
  • step 507 may be omitted from method 500.
  • step 511 whether every input has been circulated through each of the plurality of cores can be determined. If not (e.g., indicated by NO in FIG. 5) , method 500 proceeds back to step 505, and performs another round of computations and communications.
  • an input on one core can be communicated to another core. The communication of the input can be performed in parallel with the computation using the input.
  • each core can perform another computation using corresponding part of the weight matrix and an input communicated to the core.
  • a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to. The communication of the result of the computation can be performed in parallel with next round of computation. For example, with reference to FIG.
  • CU 208 can communicate input_b from core 202a to core 202d, input_c from core 202b to core 202a, input_d from core 202c to core 202b, and input_a from core 202d to core 202c.
  • core 202a can perform a computation using input_c and part_a of the weight matrix
  • core 202b can perform a computation using input_d and part_b of the weight matrix, and so on.
  • the result of computation on core 202a using input_c and part_a of the weight matrix can be communicated to core 202c
  • the result of computation on core 202b using input_d and part_b of the weight matrix can be communicated to core 202d, and so on.
  • Method 500 can include a plurality of rounds of communications and computations (e.g., steps 505 and 507) until every input goes through each of the cores.
  • communication of an input can be performed in parallel with current computations using this input.
  • communication of input_b from core 202a to core 202d can be performed in parallel with computation on core 202a using input_b and part_a of the weight matrix
  • communication of input_c from core 202b to 202a can be performed in parallel with computation on core 202b using input_c and part_b of the weight matrix, and so on.
  • step 513 results of the computations can be output.
  • the results can include computation results using each of the inputs and each part of the weight matrix corresponding to the plurality of cores.
  • FIG. 6 is a schematic diagram illustrating an exemplary process 600 of processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure. It is appreciated that process 600 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, process 600 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
  • the HAPU can include four cores, core_0, core_1, core_2 and core_3. Each core can be associated with (e. g, store) a corresponding part of the weight matrix.
  • a weight matrix can be divided into four parts, w0, w1, w2 and w3, which are distributed to core_0, core_1, core_2 and core_3, respectively.
  • a core can store its corresponding part of the weight matrix in local memory.
  • the HAPU can include more or less cores and the weight matrix can include more or less parts.
  • a number of parts of the weight matrix can be equal to a number of cores on the HAPU.
  • the number of parts of the weight matrix can be less than the number of cores on the HAPU. In this case, some of cores on the HAPU have no corresponding parts of the weight matrix.
  • a plurality of inputs e.g., b0, b1, b2 and b3 as shown in FIG. 6, are loaded onto the plurality of cores on the HAPU, e.g., core_0, core_1, core_2 and core_3, respectively.
  • the number of inputs can be equal to the number of cores on the HAPU each having a part of weight matrix. In some other embodiments, the number of inputs can be less than the number of cores on the HAPU each having a part of weight matrix.
  • each core can perform a first round of computation using an input on the core and a part of the weight matrix corresponding to the core.
  • core_0 can perform a first round of computation using an input b0 on the core_0 and w0 of the weight matrix
  • core_1 can perform a first round of computation using an input b1 on the core_1 and w1 of the weight matrix
  • Each core can store the result of this round of computation shown as b0/w0, b1/w1, b2/w2 or b3/w3 in FIG. 6) in its local memory.
  • each core can also store the result of this round of computation (e.g., b0/w0, b1/w1, b2/w2 and b3/w3) at a correspond address in an output (e.g., output_0, output_1, output_2 or output_3) .
  • an input on one core can be communicated to another core, for example, in a sequential order.
  • CU 208 can perform the communication with assistance of transmission engine 2026.
  • transmission engine 2026 can transmit or read an input from the local memory 2022 to CU 208 which communicate it to another core.
  • input b0 can be communicated from core_0 to core_3
  • input b1 can be communicated to from core_1 core_0
  • input b2 can be communicated from core_2 to core_1
  • input b3 can be communicated from core_3 to core 2.
  • the communication of an input can be performed in parallel with the computation on the core using this input.
  • each core can perform a second round of computation using an input on the core and the part of the weight matrix corresponding to the core.
  • core_0 can perform a second round of computation using an input b1 on the core_0 and w0 of the weight matrix
  • core_1 can perform a second round of computation using an input b2 on the core_1 and w1 of the weight matrix
  • Each core can store the result of the second round of computation (shown as b1/w0, b2/w1, b3/w2 and b0/w3 in FIG. 6) in its local memory.
  • a second round of sequential communication of an input on one core to another core can be performed.
  • input b1 on core_0 can be communicated to core_3
  • input b2 on core_1 can be communicated to core_0
  • input b3 on core_2 can be communicated to core_1
  • input b0 on core_3 can be communicated to core 2.
  • the second round of communication of an input can also be performed in parallel with the second round of computation on the core using this input.
  • each core can perform a third round of computation using an input on the core and the part of the weight matrix corresponding to the core.
  • core_0 can perform a third round of computation using an input b2 on the core_0 and w0 of the weight matrix
  • core_1 can perform a third round of computation using an input b3 on the core_1 and w1 of the weight matrix
  • Each core can store the result of the third round of computation (shown as b2/w0, b3/w1, b0/w2 and b1/w3 in FIG. 6) in its local memory.
  • a third round of communication of an input on one core to another core can be performed.
  • input b2 on core_0 can be communicated to core_3
  • input b3 on core_1 can be communicated to core_0
  • input b0 on core_2 can be communicated to core_1
  • input b1 on core_3 can be communicated to core 2.
  • the third round of communication of an input can also be performed in parallel with the third round of computation on the core using the input.
  • a result of previous round (e.g., second round) of computation can be communicated to the core which the input is initially loaded to.
  • CU 208 can perform the communication of the result with assistance of transmission engine 2026.
  • transmission engine 2026 can transmit the result from the local memory 2022 to CU 208 which communicates it to the corresponding core.
  • result b1/w0 on core_0 can be communicated to core_1
  • result b2/w1 on core_1 can be communicated to core_2
  • result b3/w2 on core_2 can be communicated to core_3
  • result b0/w3 on core_3 can be communicated to core_0.
  • the communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) .
  • the communication of the result of previous round (e.g., second round) of computation can be performed in parallel with current round (e.g., third round) of computation.
  • each core can perform a fourth round of computation using an input on the core and the part of the weight matrix corresponding to the core.
  • core_0 can perform a fourth round of computation using an input b3 on the core_0 and w0 of the weight matrix
  • core_1 can perform a fourth round of computation using an input b0 on the core_1 and w1 of the weight matrix
  • Each core can store the result of the fourth round of computation (shown as b3/w0, b0/w1, b1/w2, and b2/w3 in FIG. 6) in its local memory.
  • a result of previous round (e.g., third round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to.
  • result b2/w0 on core_0 can be communicated to core_2
  • result b3/w1 on core_1 can be communicated to core_3
  • result b0/w2 on core_2 can be communicated to core_0
  • result b1/w3 on core_3 can be communicated to core_1.
  • the communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) .
  • the communication of the result of previous round (e.g., third round) of computation can be performed in parallel with current round (e.g., fourth round) of computation.
  • a result of the final round (e.g., fourth round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to.
  • result b3/w0 on core_0 can be communicated to core_3
  • result b0/w1 on core_1 can be communicated to core_0
  • result b1/w2 on core_2 can be communicated to core_1
  • result b2/w3 on core_3 can be communicated to core_2.
  • the communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) .
  • outputs (e.g., output_0, output_1, output_2, and output_3) can be provided to other components of the HAPU or neural network.
  • Embodiments of the disclosure can bring many technical advantages.
  • a plurality of cores can each have a part of, rather than the entire, weight matrix, and can perform parallel computations using parts of the weight matrix and multiple inputs.
  • Some embodiments of the disclosure can provide fast communication of data (e.g., inputs or results of computations) across cores, and perform the communication in parallel with computation, which can significantly reduce time for processing a neural network.
  • Embodiments of the disclosure can be applied to many products, environments, and scenarios.
  • some embodiments of the disclosure can be applied to Ali-NPU (e.g., Hanguang NPU) , Ali-Cloud, Ali-DAU (Database Acceleration Unit) , Ali-AI platform, GPU, TPU, or the like.
  • Ali-NPU e.g., Hanguang NPU
  • Ali-Cloud e.g., Ali-Cloud
  • Ali-DAU Database Acceleration Unit
  • Ali-AI platform e.g., GPU, TPU, or the like.
  • a computer readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc.
  • program modules may include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
  • a method for processing a neural network comprising:
  • the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;
  • communicating the first input is performed in parallel with the first computation.
  • receiving the first input at the core comprises loading the first input from an external memory to the core.
  • the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .
  • HAPU heterogeneous acceleration processing unit
  • a heterogeneous acceleration processing unit comprising:
  • a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
  • a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
  • the communication unit is configured to communicate a result of the second computation from the core to the yet another core.
  • the heterogeneous acceleration processing unit according to any of clauses 11-17, wherein the core is configured to receive each of the plurality of inputs at a plurality of different time instances, and perform a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core is configured to use a different input of the plurality of inputs.
  • a local memory for storing the first part of the weight matrix and a result of the first computation
  • At least one computation engine communicatively coupled with the local memory and configured to perform the first computation
  • a transmission engine communicatively coupled with the local memory and configured to transmit the first input.
  • a non-transitory computer readable storage media storing a set of instructions that are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising:
  • communicating the first input is performed in parallel with the first computation.
  • non-transitory computer readable storage media according to any of clauses 21-29, wherein the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .
  • HAPU heterogeneous acceleration processing unit
  • a terminal comprising:
  • HAPU heterogeneous acceleration processing unit
  • a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
  • a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
  • the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

Related to methods and apparatuses for processing a neural network, the methods include: receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores; receiving a first input of the plurality of inputs at a core of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and communicating the first input from the core to another core of the plurality of cores.

Description

METHODS AND APPARATUSES FOR PROCESSING NEURAL NETWORK BACKGROUND
In machine learning (ML) or deep learning (DL) , a neural network (NN) is a mechanism that basically mimics how a human brain learns. A deep neural network (DNN) is a category of neural networks. Over the years, neural networks (e.g., DNNs) have demonstrated successes in various domains such as computer vision, natural language processing and the like.
Many neural networks have a weight matrix of large size, which requires significant computational and storage resources for neural network training or deployment. Some techniques have been developed to process neural networks with weight matrix of large size on multi-core processing units. For example, one solution is to utilize a level-2 shared memory (i.e., sharing memory across multiple processors) to expand the storage space. But this solution is complicated, difficult to be managed, and would significantly increase communication delay (e.g., read, write, or transmission delay) .
SUMMARY
In some embodiments, an exemplary method for processing a neural network, comprising: receiving a plurality of inputs at a processing unit, the processing unit including a plurality of cores, and a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; receiving a first input of the plurality of inputs at a core of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.
In some embodiments, an exemplary heterogeneous acceleration processing unit (HAPU) can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores. A weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores. A core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core. The communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.
In some embodiments, an exemplary terminal can include a host unit and a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit. The HAPU can include a plurality of cores and a communication unit communicatively coupled with the plurality of cores. A weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores. A core of the plurality of cores can be configured to: receive a first input of a plurality of inputs; and perform a first computation using the first input and a first part of the weight matrix, the first part of the weight matrix being associated with the core. The communication unit can be configured to communicate the first input from the core to another core of the plurality of cores.
In some embodiments, an exemplary non-transitory computer readable storage media stores a set of instructions. The instructions are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising: receiving a plurality of inputs; receiving a first input of the plurality of inputs at a core of the plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores; performing, at the core, a first computation using the first input and a first part of the weight  matrix, the first part of the weight matrix being associated with the core; and communicating the first input from the core to another core of the plurality of cores.
Additional objects and advantages of the present disclosure will be set forth in part in the following detailed description, and in part will be obvious from the description, or may be learned by practice of the present disclosure. The objects and advantages of the present disclosure will be realized and attained by means of the elements and combinations particularly pointed out in the appended claims.
It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and are not restrictive of the disclosed embodiments.
BRIEF DESCRIPTION OF THE DRAWINGS
The accompanying drawings, which comprise a part of this specification, illustrate several embodiments and, together with the description, serve to explain the principles and features of the disclosed embodiments. In the drawings:
FIG. 1 is a schematic diagram of an exemplary neural network, according to some embodiments of the present disclosure.
FIG. 2 is a block diagram of an exemplary heterogeneous acceleration processing unit (HAPU) , according to some embodiments of the present disclosure.
FIG. 3A is a block diagram of an exemplary machine learning system, according to some embodiments of the present disclosure.
FIG. 3B is a schematic diagram of an exemplary cloud system, according to some embodiments of the present disclosure.
FIG. 4 is a flowchart of an exemplary method for processing a neural network, according to some embodiments of the present disclosure.
FIG. 5 is a flowchart of another exemplary method for processing a neural network, according to some embodiments of the present disclosure.
FIG. 6 is a schematic diagram illustrating processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure.
DETAILED DESCRIPTION
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the invention. Instead, they are merely examples of apparatuses, systems and methods consistent with aspects related to the invention as recited in the appended claims.
The methods and apparatuses disclosed herein can be used for configuring neural network processing units (NPUs) in various neural network-based architectures, such as deep neural networks (DNNs) , convolutional neural networks (CNNs) , recurrent neural networks (RNNs) , or the like.
FIG. 1 illustrates an exemplary neural network (NN) 100 in which embodiments of the present disclosure can be implemented. As depicted in FIG. 1, neural network 100 can include an input layer 120 that accepts inputs, e.g., inputs 110-1, ..., 110-m.Inputs can include an image, text, or any other structured or unstructured data for processing by neural network 100. In some embodiments, neural network 100 can accept a plurality of inputs simultaneously. For example, in FIG. 1, neural network 100 can accept up to m inputs simultaneously. Additionally or alternatively, input layer 120 can accept up to m inputs in rapid succession, e.g., such that input 110-1 is accepted by input layer 120 in one cycle, a second input is accepted by input layer 120 in a second cycle in which input layer  120 pushes data from input 110-1 to a first hidden layer, and so on. The present disclosure does not intend to limit the number of inputs, or the way of inputting, such as simultaneous input, rapid succession input, or the like.
Input layer 120 can comprise one or more nodes, e.g., nodes 120-1, 120-2, ..., 120-a. Each node can execute an activation function based on corresponding input (e.g., one or more of inputs 110-1, ..., 110-m) and scale the output from the activation function by a particular weight associated with the node. An activation function can comprise a Heaviside step function, a Gaussian function, a multiquadratic function, an inverse multiquadratic function, a sigmoidal function, a ReLU function, a Leaky ReLU function, a Tanh function, or the like. A weight can comprise a positive value between 0.0 and 1.0 or any other numerical value configured to allow some nodes in the layer to have corresponding output scaled more or less than output corresponding to other nodes in the layer. A plurality of weights can form a weight matrix.
As further depicted in FIG. 1, neural network 100 can include one or more hidden layers, e.g., hidden layers 130-1, ..., 130-n. Each hidden layer can comprise one or more nodes. For example, in FIG. 1, hidden layer 130-1 comprises nodes 130-1-1, 130-1-2, 130-1-3, ..., 130-1-b, and hidden layer 130-n comprises nodes 130-n-1, 130-n-2, 130-n-3, ..., 130-n-c. Similar to nodes of input layer 120, nodes of the hidden layers can execute activation functions based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.
As further depicted in FIG. 1, neural network 100 can include an output layer 140 that finalizes outputs, e.g., outputs 150-1, 150-2, ..., 150-d. Output layer 140 can comprise one or more nodes, e.g., nodes 140-1, 140-2, ..., 140-d. Similar to nodes of input layer 120 and of the hidden layers, nodes of output layer 140 can execute activation functions  based on output from connected nodes of the previous layer and scale the output from the activation functions by particular weights associated with the nodes.
Although depicted as fully connected in FIG. 1, the layers of neural network 100 can use any connection scheme. For example, one or more layers (e.g., input layer 120, hidden layer 130-1, ..., hidden layer 130-n, output layer 140, or the like) can be connected using a convolutional scheme, a sparsely connected scheme, or the like. Such embodiments can use fewer connections between one layer and a previous layer than depicted in FIG. 1.
Moreover, although depicted as a feedforward network in FIG. 1, neural network 100 can additionally or alternatively use backpropagation, e.g., by using long short-term memory (LSTM) nodes or the like. Accordingly, although neural network 100 is depicted similar to a convolutional neural network (CNN) , neural network 100 can comprise a recurrent neural network (RNN) or any other types of neural network.
FIG. 2 illustrates an exemplary heterogeneous acceleration processing unit (HAPU) 200, according to some embodiments of the present disclosure. As shown in FIG. 2, HAPU 200 can include a plurality of cores 202 (e.g.,  cores  202a, 202b, 202c, and 202d) , an interface 204, a command parser (CP) 206, and a communication unit (CU) 208. It is appreciated that HAPU 200 can also include other components, such as a global memory (not shown) and the like. In some embodiments, HAPU 200 can be implemented as a neural network processing unit (NPU) .
Interface 204 can provide communication between HAPU 200 and external devices. For example, interface 204 can include a peripheral component interconnect express (PCI-E) interface to provide connection with a host unit (not shown in FIG. 2) . Interface 204 can also include a universal serial bus (USB) , a joint test action group (JTAG) interface, a TUN/TAP interface, and/or the like.
CP 206 can receive commands or instructions from external devices (e.g., via interface 204) , and distribute the commands to corresponding component, such as one or more cores 202 or communication unit 208. For example, CP 206 can interact with host unit (e.g., under the supervision of kernel mode driver (KMD) ) , and receive commands from the host unit. The commands can include a memory access command or a computation command. CP 206 can distribute memory access command to CU 208, and computation command to one or more cores 202.
CU 208 can be communicatively coupled with components of HAPU 200, and assist with transferring data between these components. For example, CU 208 can assist with transferring data between multiple cores 202 (e.g., cores 202a-202d) or within each core 202a-202d. CU 208 can also allow off-chip devices to access both on-chip and off-chip memory without causing an interrupt. For example, CU 208 can load data or instructions into local memory of cores 202. Thus, CU 208 can also generate memory addresses and initiate memory read or write cycles. CU 208 can also contain several hardware registers that can be written and read by the one or more cores 202, including a memory address register, a byte-count register, one or more control registers, and/or other types of registers. These registers can specify the source, the destination, the direction of the transfer (reading from the input/output (I/O) device or writing to the I/O device) , the size of the transfer unit, and/or the number of bytes to transfer in one burst. It is appreciated that each core (e.g., core 202a) can include a sub-CU (e.g., transmission engine 2026 as shown in FIG. 2) , which can be used to transfer data within the core and across cores.
In some embodiments, CU 208 can include a direct memory access (DMA) unit (not shown) and a bus (not shown) . The bus can provide high speed cross-core communication. The bus also connects cores 202 with other units, such as the off-chip memory or peripherals.
CU 208 can also move block data among cores 202 via a bus. While a single core 202 is capable of handling a typical training or inference task, a plurality of cores 202 can work together via the bus to take on large and complex tasks (e.g., processing a neural network with a large weight matrix) .
Core 202a-202d (e.g., core 202a as shown on right side of FIG. 2) can include one or more computation engines configured to perform one or more operations based on commands, e.g., commands received from CP 206. The operation can include multiplication, addition, multiply-accumulate, convolution, element-wise operation, and the like. For example, one or more computation engines of core 202a can include a convolution unit, a pooling unit, a matrix multiplication unit, an element-wise operation (EWOP) unit, and/or the like.
As shown in FIG. 2, core 202a-202d (e.g., core 202a) can also include one or more local memories (LMs) 2022 and transmission engine 2026. Local memory 2022 can provide storage space with fast read/write speed. In some embodiments, storage space of local memory 2022 can be 250 megabytes (MB) and above, which can reduce interaction with a global memory. With local memory 2022, part or all of data access can be performed within each core 202a-202d, reducing the latency caused by data access.
Transmission engine 2026 can be included in CU 208 or in each core 202a-202d as an independent communication unit. Transmission engine 2026 can be communicatively coupled with components of core 202, e.g., local memory 2022 and computation engine 2024, and assist with transferring data or commands (or instructions) between these components. Transmission engine 2026 can also assist with communicating data or commands across cores. For example, transmission engine 2026 can transmit data from local memory 2022 or computation engine 2024 to components outside the core, e.g., CU 208, or receive data from components outside the core to local memory 2022.
In some embodiments, core 202a-202d can also include a sequencer (not shown) configured to retrieve commands and distribute the commands to other components of core. For example, the sequencer can distribute a computation command to computation engine 2024 to perform a computation, or distribute a transmission command to transmission engine 2026 to perform a transmission operation.
FIG. 3A illustrates an exemplary machine learning system 300, according to some embodiments of the present disclosure. In some embodiments, machine learning system 300 can be implemented in a computing device or a terminal. As shown in FIG. 3A, machine learning system 300 can include a host unit 302 (e.g., a central processing unit (CPU) ) , a disk 304, a host memory 306, and a HAPU 308. In some embodiments, host memory 306 can be an integral memory or an external memory associated with host unit 302. Host memory 306 can be a local or a global memory. In some embodiments, disk 304 may comprise an external memory configured to provide additional memory for host unit 302.
Host unit 302 (e.g., an X86 or ARM central processing unit) can be coupled with host memory 306 and disk 304, and configured to process general instructions. For example, an operating system (OS) , a software, an application or a program can run on host unit 302. HAPU 308 can be coupled to host unit 302 through a peripheral interface (e.g., interface 204) . As referred to herein, a HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) can be a computing device for accelerating neural network processing tasks, e.g., neural network training or inference. In some embodiments, HAPU 308 can be configured to be used as a co-processor of host unit 302.
In some embodiments, a compiler can be included in a host unit (e.g., host unit 302 of FIG. 3A) , host memory (e.g., host memory 306 of FIG. 3A) or HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) . The compiler can be configured to push one or more commands or instructions to HAPU. In some embodiments, the compiler can be  implemented as a program or computer software that transforms computer codes written in one programming language into instructions for HAPU to create an executable program. In machine learning applications, a compiler can perform a variety of operations, for example, pre-processing, lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, initialization of a neural network, code optimization, and code generation, or combinations thereof. For example, the compiler can compile a neural network to generate static or semi-static parameters, e.g., connections among nodes (or neurons) and weights of the nodes.
As discussed above, the commands pushed into HAPU can be further distributed to corresponding components (e.g., one or more core 202 or CU 208 of FIG. 2) of HAPU by CP (e.g., CP 206 of FIG. 2) .
FIG. 3B illustrates a schematic diagram of an exemplary cloud system 310, according to some embodiments of the disclosure. The cloud system 310 can include a plurality of computing servers (e.g., computing servers 312 and 314) . In some embodiments, computing server 312 can, for example, include the machine learning system 300, which includes HAPU 308. The cloud system 310 may be connected to user devices via a network. With the assistance of HAPU 308, cloud system 310 can provide extended AI capabilities of image recognition, facial recognition, translations, 3D modeling, and the like.
It is appreciated that, however, HAPU (e.g., HAPU 200 of FIG. 2 or HAPU 308 of FIG. 3A-3B) can be implemented in computing devices or terminals in various ways. For example, HAPU can be integrated in a computing device or terminal, such as a smart phone, a tablet, wearable device, or the like.
FIG. 4 is a flowchart of an exemplary method 400 for processing a neural network, according to some embodiments of the present disclosure. Method 400 can be implemented by a processing unit, such as HAPU 200 of FIG. 2 or HAPU 308 of FIGs. 3A- 3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, method 400 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
As shown in FIG. 4, at step 401, a plurality of inputs can be transmitted to each of a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) in sequence. For example, with reference to FIG. 2, CU 208 can transmit a plurality of inputs to a plurality of cores 202 (e.g., cores 202a-202d) of HAPU 200 respectively. CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 of HAPU 200. In some embodiments, CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories of the plurality of cores 202. For example, CU 208 can load input_a, input_b, input_c and input_d to core 202a, core 202b, core 202c, and core 202d, respectively. In some embodiments, CU 208 can communicate (e.g., transfer or copy) an input from one core to another core. For example, for input_a, input_b, input_c and input_d, CU 208 can transfer input_a from core 202a to core 202d, transfer input_b from core 202b to 202a, transfer input_c from core 202c to core 202b, and transfer input_d from core 202d to core 202c. As another example, CU 208 can copy input_a from core 202a and save a copy of input_a in core 202d, copy input_b from core 202b and save a copy of input_b in 202a, copy input_c from core 202c and save a copy of input_c in core 202b, and copy input_d from core 202d and save a copy of input_d in core 202c. The HAPU may perform a plurality of rounds of communications until every input is received at each of the cores. In the case that the number of the cores is N, the HAPU may perform an initial round of loading of the inputs to respective cores of the HAPU and (N-1) rounds of communications of the current inputs in the cores to other cores of the HAPU in sequence. In  some embodiments, transmission engine 2026 can assist this communication by, e.g., reading the input from local memory and transmitting it to CU 208.
The plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network. In some embodiments, the plurality of inputs can include a plurality of activations. The number of the inputs can be equal to or less than the number of the cores in the HAPU. In the case that the number of inputs is less than the number of available cores, some of the cores may not have an input.
At step 403, at each of the plurality of cores, a computation is repeatedly performed using the part of a weight matrix corresponding to the core and the input received at the core. For example, during the initial loading of the inputs or each round of communication of the inputs from other cores, each of the plurality of cores can perform a computation using the part of the weight matrix corresponding to the core and an input received (e.g., loaded from an external memory or communicated from another core) at the core. With reference to FIG. 2, each core 202 (e.g., core 202a, core 202b, core 202c or core 202d) can perform a computation using the part of the weight matrix corresponding to the core and each input loaded or communicated to the core by CU 208. Each core can perform a plurality of rounds of computations, each round with a different input. The number of the rounds of computations performed on each core can be equal to the number of inputs.
The weight matrix relates to the neural network being processed. The weight matrix can be divided into a plurality of parts. The plurality of cores each has a corresponding part of the weight matrix. The number of parts of the weight matrix can be equal to the number of cores. Each core can store a corresponding part of the weight matrix in its local memory. For example, with reference to FIG. 2, CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) .
Since each part of the weight matrix have a smaller size than the entire weight matrix, requirements for computation and storage resources can be reduced. Then, when the plurality of parts of the weight matrix are distributed to multiple cores, each core would have sufficient computation and storage resources to perform a computation with a corresponding part of the weight matrix.
In some embodiments, communication of an input to another core can be performed in parallel with current computation using this input. For example, with reference to FIG. 2, communication of input_a from core 202a to core 202d can be performed in parallel with computation on core 202a using input_a and corresponding part_a of the weight matrix, communication of input_b from core 202b to 202a can be performed in parallel with computation on core 202b using input_b and corresponding part_b of the weight matrix, and so on.
At step 405, results of computations using an input received from another core can be communicated to the core which the input is initially loaded to. For example, with reference to FIG. 2, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d. Results of computations using input_a and a part of the weight matrix stored at core 202d can be communicated by CU 208 to core 202a, results of computations using input_b and a part of the weight matrix stored at core 202a can be communicated by CU 208 to core 202b, and so on. In some embodiments, transmission engine 2026 can perform the communication by, e.g., reading the result from local memory and transmitting it to CU 208.
In some embodiments, step 405 may be omitted from method 400.
In some embodiments, step 405 can be performed in parallel with current round of computations. For example, with reference to FIG. 2, communication of a result of a computation on core 202a using input_b and part_a of the weight matrix to core 202b can  be performed in parallel with computation on core 202a using input_c and part_a of the weight matrix, and so on.
By performing  steps  401, 403, 405, each of the plurality of cores performs rounds of computation using each of the inputs and the part of the weight matrix corresponding to the core. For example, referring to FIG. 2, each of the input_a, input_b, input_c and input_d is computed with each part of the weight matrix corresponding to the cores 202a-202d. After each of the plurality of inputs is used by each of the plurality of cores for computation, the method 400 may proceed to step 407.
At step 407, results of the computations can be output. The results can include computation results using all inputs and all parts of the weight matrix.
FIG. 5 illustrates a flowchart of another exemplary method 500 for processing a neural network, according to some embodiments of the present disclosure. Method 500 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, method 500 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
At step 501, as shown in FIG. 5, a plurality of inputs can be loaded onto a plurality of cores of a HAPU (e.g., HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B) . For example, as discussed above with reference to FIG. 2, CU 208 can interact with external components (e.g., external memory) and load the plurality of inputs onto cores 202 (e.g., cores 202a-d) of HAPU 200. In some embodiments, CU 208 can receive a command (e.g., a memory access command) from CP 206, and in accordance with the command, load the plurality of inputs from external memory to local memories 2024 of the plurality of cores  202. For example, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d.
The plurality of inputs can include an image, text, or any other structured or unstructured data for processing neural network. In some embodiments, the plurality of inputs can include a plurality of activation functions. A number of the inputs can be equal to or less than a number of the cores. In the case that the number of inputs is less than the number of cores, some of the plurality of cores do not have an input.
At step 503, at each core of the plurality of cores, a computation can be performed using corresponding part of a weight matrix and an input loaded onto the core. For example, each of the plurality of cores can perform a computation using the corresponding part of the weight matrix and an input loaded to the core. The weight matrix relates to the neural network under processing. The weight matrix can be divided into a plurality of parts. The plurality of cores can each have a corresponding part of the weight matrix. The number of parts of the weight matrix can be equal to the number of cores. Each core can store a corresponding part of the weight matrix in its local memory. For example, with reference to FIG. 2, CU 208 can load a plurality of parts (e.g., part_a, part_b, part_c and part_d) of a weight matrix into local memories 2022 of a plurality of cores 202 (e.g., core 202a, core 202b, core 202c and core 202d) . Each core 202 (e.g., core 202a, core 202b, core 202c or core 202d) can perform a computation using the corresponding part (e.g., part_a, part_b, part_c or part_d) of the weight matrix and each input loaded to the core.
At step 505, an input on one core can be communicated to another core. In some embodiments, the input is sequentially communicated to another core. For example, with reference to FIG. 2, CU 208 can sequentially communicate input_a from core 202a to core 202d, input_b from core 202b to 202a, input_c from core 202c to core 202b, and input_d from core 202d to core 202c. In some embodiments, transmission engine 2026 can assist this  communication. For example, transmission engine 2026 on core 202a can reading the input_a from local memory 2022 and transmitting it to CU 208.
At step 507, at each core of the plurality of cores, a computation can be performed using corresponding part of the weight matrix and an input communicated to the core. For example, with reference to FIG. 2, core 202a can perform a computation using input_b and part_a of the weight matrix, core 202b can perform a computation using input_c and part_b of the weight matrix, and so on.
At step 509, a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to in step 501. For example, with reference to FIG. 2, CU 208 can load input_a to core 202a, input_b to core 202b, input_c to core 202c, and input_d to core 202d. Results of computations using input_a and part_b, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202a, results of computations using input_b and part_a, part_c and part_d of the weight matrix can be communicated by CU 208 to core 202b, and so on. The communication of a result of computation can be performed in parallel with next round of computation. For example, with reference to FIG. 2, communication of a result of a computation on core 202a using input_b and part_a of the weight matrix to core 202b can be performed in parallel with computation on core 202a using input_c and part_a of the weight matrix, and so on. In some embodiments, step 507 may be omitted from method 500.
At step 511, whether every input has been circulated through each of the plurality of cores can be determined. If not (e.g., indicated by NO in FIG. 5) , method 500 proceeds back to step 505, and performs another round of computations and communications. At step 505, an input on one core can be communicated to another core. The communication of the input can be performed in parallel with the computation using the input. At step 507, each core can perform another computation using corresponding part of the weight matrix  and an input communicated to the core. In some embodiments, at step 509, a result of the computation using a communicated input can be communicated to the core which the communicated input is initially loaded to. The communication of the result of the computation can be performed in parallel with next round of computation. For example, with reference to FIG. 2, at step 505, CU 208 can communicate input_b from core 202a to core 202d, input_c from core 202b to core 202a, input_d from core 202c to core 202b, and input_a from core 202d to core 202c. At step 507, core 202a can perform a computation using input_c and part_a of the weight matrix, core 202b can perform a computation using input_d and part_b of the weight matrix, and so on. At step 509, the result of computation on core 202a using input_c and part_a of the weight matrix can be communicated to core 202c, the result of computation on core 202b using input_d and part_b of the weight matrix can be communicated to core 202d, and so on.
Method 500 can include a plurality of rounds of communications and computations (e.g., steps 505 and 507) until every input goes through each of the cores. In some embodiments, communication of an input can be performed in parallel with current computations using this input. For example, with reference to FIG. 2, communication of input_b from core 202a to core 202d can be performed in parallel with computation on core 202a using input_b and part_a of the weight matrix, communication of input_c from core 202b to 202a can be performed in parallel with computation on core 202b using input_c and part_b of the weight matrix, and so on.
If every input has been circulated through each of the plurality of cores (e.g., indicated by YES in FIG. 5) , method 500 proceeds to step 513. At step 513, results of the computations can be output. The results can include computation results using each of the inputs and each part of the weight matrix corresponding to the plurality of cores.
FIG. 6 is a schematic diagram illustrating an exemplary process 600 of processing a neural network using a plurality of cores of an HAPU, according to some embodiments of the present disclosure. It is appreciated that process 600 can be implemented by a processing unit, such as HAPU 200 of FIG. 2, HAPU 308 of FIGs. 3A-3B, a computing device, such as machine learning system 300 of FIGs. 3A-3B, or a cloud system, such as cloud system 310 of FIG. 3B. In some embodiments, process 600 can be implemented by a computer program product, embodied in a computer-readable medium, including computer-executable instructions, such as program code, executed by computers.
As shown in FIG. 6, the HAPU can include four cores, core_0, core_1, core_2 and core_3. Each core can be associated with (e. g, store) a corresponding part of the weight matrix. A weight matrix can be divided into four parts, w0, w1, w2 and w3, which are distributed to core_0, core_1, core_2 and core_3, respectively. For example, a core can store its corresponding part of the weight matrix in local memory. It is appreciated that while four cores and four parts of the weight matrix are shown, the HAPU can include more or less cores and the weight matrix can include more or less parts. In some embodiments, a number of parts of the weight matrix can be equal to a number of cores on the HAPU. In some other embodiments, the number of parts of the weight matrix can be less than the number of cores on the HAPU. In this case, some of cores on the HAPU have no corresponding parts of the weight matrix.
A plurality of inputs, e.g., b0, b1, b2 and b3 as shown in FIG. 6, are loaded onto the plurality of cores on the HAPU, e.g., core_0, core_1, core_2 and core_3, respectively. In some embodiments, the number of inputs can be equal to the number of cores on the HAPU each having a part of weight matrix. In some other embodiments, the number of inputs can be less than the number of cores on the HAPU each having a part of weight matrix.
At time t0, each core can perform a first round of computation using an input on the core and a part of the weight matrix corresponding to the core. For example, core_0 can perform a first round of computation using an input b0 on the core_0 and w0 of the weight matrix, core_1 can perform a first round of computation using an input b1 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of this round of computation shown as b0/w0, b1/w1, b2/w2 or b3/w3 in FIG. 6) in its local memory. In some embodiments, each core can also store the result of this round of computation (e.g., b0/w0, b1/w1, b2/w2 and b3/w3) at a correspond address in an output (e.g., output_0, output_1, output_2 or output_3) .
In addition, an input on one core can be communicated to another core, for example, in a sequential order. With reference to FIG. 2, CU 208 can perform the communication with assistance of transmission engine 2026. For example, transmission engine 2026 can transmit or read an input from the local memory 2022 to CU 208 which communicate it to another core. As shown in FIG. 6, input b0 can be communicated from core_0 to core_3, input b1 can be communicated to from core_1 core_0, input b2 can be communicated from core_2 to core_1, and input b3 can be communicated from core_3 to core 2. The communication of an input can be performed in parallel with the computation on the core using this input.
At time t1, each core can perform a second round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a second round of computation using an input b1 on the core_0 and w0 of the weight matrix, core_1 can perform a second round of computation using an input b2 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the second round of computation (shown as b1/w0, b2/w1, b3/w2 and b0/w3 in FIG. 6) in its local memory.
In addition, a second round of sequential communication of an input on one core to another core can be performed. As shown in FIG. 6, input b1 on core_0 can be communicated to core_3, input b2 on core_1 can be communicated to core_0, input b3 on core_2 can be communicated to core_1, and input b0 on core_3 can be communicated to core 2. The second round of communication of an input can also be performed in parallel with the second round of computation on the core using this input.
At time t2, each core can perform a third round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a third round of computation using an input b2 on the core_0 and w0 of the weight matrix, core_1 can perform a third round of computation using an input b3 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the third round of computation (shown as b2/w0, b3/w1, b0/w2 and b1/w3 in FIG. 6) in its local memory.
In addition, a third round of communication of an input on one core to another core can be performed. As shown in FIG. 6, input b2 on core_0 can be communicated to core_3, input b3 on core_1 can be communicated to core_0, input b0 on core_2 can be communicated to core_1, and input b1 on core_3 can be communicated to core 2. The third round of communication of an input can also be performed in parallel with the third round of computation on the core using the input.
In some embodiments, a result of previous round (e.g., second round) of computation can be communicated to the core which the input is initially loaded to. With reference to FIG. 2, CU 208 can perform the communication of the result with assistance of transmission engine 2026. For example, transmission engine 2026 can transmit the result from the local memory 2022 to CU 208 which communicates it to the corresponding core. As shown in the shaded blocks of FIG. 6, result b1/w0 on core_0 can be communicated to  core_1, result b2/w1 on core_1 can be communicated to core_2, result b3/w2 on core_2 can be communicated to core_3, and result b0/w3 on core_3 can be communicated to core_0. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) . In some embodiments, the communication of the result of previous round (e.g., second round) of computation can be performed in parallel with current round (e.g., third round) of computation.
At time t3, each core can perform a fourth round of computation using an input on the core and the part of the weight matrix corresponding to the core. For example, core_0 can perform a fourth round of computation using an input b3 on the core_0 and w0 of the weight matrix, core_1 can perform a fourth round of computation using an input b0 on the core_1 and w1 of the weight matrix, and so on. Each core can store the result of the fourth round of computation (shown as b3/w0, b0/w1, b1/w2, and b2/w3 in FIG. 6) in its local memory.
In some embodiments, a result of previous round (e.g., third round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to. As shown in FIG. 6, result b2/w0 on core_0 can be communicated to core_2, result b3/w1 on core_1 can be communicated to core_3, result b0/w2 on core_2 can be communicated to core_0, and result b1/w3 on core_3 can be communicated to core_1. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) . In some embodiments, the communication of the result of previous round (e.g., third round) of computation can be performed in parallel with current round (e.g., fourth round) of computation.
In some embodiments, a result of the final round (e.g., fourth round) of computation using an input and a part of the weight matrix can be communicated to the core which the input is initially loaded to. As shown in FIG. 6, result b3/w0 on core_0 can be  communicated to core_3, result b0/w1 on core_1 can be communicated to core_0, result b1/w2 on core_2 can be communicated to core_1, and result b2/w3 on core_3 can be communicated to core_2. The communicated results can be stored at respective addresses in outputs (e.g., output_0, output_1, output_2, and output_3) .
In some embodiments, outputs (e.g., output_0, output_1, output_2, and output_3) can be provided to other components of the HAPU or neural network.
Embodiments of the disclosure can bring many technical advantages. For example, in some embodiments of the disclosure, a plurality of cores can each have a part of, rather than the entire, weight matrix, and can perform parallel computations using parts of the weight matrix and multiple inputs. Some embodiments of the disclosure can provide fast communication of data (e.g., inputs or results of computations) across cores, and perform the communication in parallel with computation, which can significantly reduce time for processing a neural network.
Embodiments of the disclosure can be applied to many products, environments, and scenarios. For example, some embodiments of the disclosure can be applied to Ali-NPU (e.g., Hanguang NPU) , Ali-Cloud, Ali-DAU (Database Acceleration Unit) , Ali-AI platform, GPU, TPU, or the like.
The various example embodiments described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer readable medium may include removeable and nonremovable storage devices including, but not limited to, Read Only Memory (ROM) , Random Access Memory (RAM) , compact discs (CDs) , digital versatile discs (DVD) , etc. Generally, program modules may include routines, programs, objects, components, data structures, etc. that  perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
The embodiments may further be described using the following clauses:
1. A method for processing a neural network, comprising:
receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;
receiving a first input of the plurality of inputs at a core of the plurality of cores;
performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
communicating the first input from the core to another core of the plurality of cores.
2. The method according to clause 1, further comprising:
receiving, at the core, a second input from yet another core of the plurality of cores;
performing, at the core, a second computation using the first part of the weight matrix and the second input; and
communicating a result of the second computation from the core to the yet another core.
3. The method according to clause 1 or clause 2, wherein:
communicating the first input is performed in parallel with the first computation.
4. The method according to any of clauses 1-3, further comprising: communicating the second input from the core to the another core.
5. The method according to clause 4, wherein communicating the second input is performed in parallel with the second computation.
6. The method according to any of clauses 1-4, further comprising:
performing, at the another core, a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
7. The method according to any of clauses 1-6, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.
8. The method according to any of clauses 1-7, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
9. The method according to any of clauses 1-8, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.
10. The method according to any of clauses 1-9, wherein the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .
11. A heterogeneous acceleration processing unit (HAPU) , comprising:
a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
receive a first input of a plurality of inputs; and
perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
12. The heterogeneous acceleration processing unit according to clause 11, wherein the core is configured to:
receive a second input from yet another core of the plurality of cores; and
perform a second computation using the first part of the weight matrix and the second input, and
wherein the communication unit is configured to communicate a result of the second computation from the core to the yet another core.
13. The heterogeneous acceleration processing unit according to clauses 11 or 12, wherein the communication of the first input from the core to the another core by the communication unit is performed in parallel with the first computation by the core.
14. The heterogeneous acceleration processing unit according to any of clauses 11-13, wherein the communication unit is configured to communicate the second input from the first core to the another core.
15. The heterogeneous acceleration processing unit according to clause 14, wherein the communication of the second input from the core to the another core by the communication unit is performed in parallel with the second computation by the core.
16. The heterogeneous acceleration processing unit according to any of clauses 11-15, wherein the another core is configured to perform a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
17. The heterogeneous acceleration processing unit according to any of clauses 11-16, wherein the communication unit is configured to load the first input from an external memory to the core.
18. The heterogeneous acceleration processing unit according to any of clauses 11-17, wherein the core is configured to receive each of the plurality of inputs at a plurality of different time instances, and perform a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core is configured to use a different input of the plurality of inputs.
19. The heterogeneous acceleration processing unit according to any of clauses 11-18, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.
20. The heterogeneous acceleration processing unit according to any of clauses 11-19, wherein the core comprises:
a local memory for storing the first part of the weight matrix and a result of the first computation;
at least one computation engine communicatively coupled with the local memory and configured to perform the first computation; and
a transmission engine communicatively coupled with the local memory and configured to transmit the first input.
21. A non-transitory computer readable storage media storing a set of instructions that are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising:
receiving a plurality of inputs;
receiving a first input of the plurality of inputs at a core of the plurality of cores, wherein a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;
performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
communicating the first input from the core to another core of the plurality of cores.
22. The non-transitory computer readable storage media according to clause 21, wherein the method further comprises:
receiving, at the core, a second input from yet another core of the plurality of cores;
performing, at the core, a second computation using the first part of the weight matrix and the second input; and
communicating a result of the second computation from the core to the yet another core.
23. The non-transitory computer readable storage media according to clause 21 or clause 22, wherein:
communicating the first input is performed in parallel with the first computation.
24. The non-transitory computer readable storage media according to any of clauses 21-23, wherein the method further comprises:
communicating the second input from the core to the another core.
25. The non-transitory computer readable storage media according to clause 24, wherein communicating the second input is performed in parallel with the second computation.
26. The non-transitory computer readable storage media according to any of clauses 21-25, wherein the method further comprises:
performing, at the another core, a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
27. The non-transitory computer readable storage media according to any of clauses 21-26, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.
28. The non-transitory computer readable storage media according to any of clauses 21-27, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
29. The non-transitory computer readable storage media according to any of clauses 21-28, wherein the number of the plurality of inputs is equal to or less than the number of the plurality of cores.
30. The non-transitory computer readable storage media according to any of clauses 21-29, wherein the processing unit comprises a heterogeneous acceleration processing unit (HAPU) .
31. A terminal, comprising:
a host unit; and
a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit, comprising:
a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
receive a first input of a plurality of inputs; and
perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to precise forms or embodiments disclosed. Modifications and adaptations of the embodiments will be apparent from consideration of the specification and practice of the disclosed embodiments. For example, the described implementations include hardware, but systems and methods consistent with the present disclosure can be implemented with hardware and software. In addition, while certain components have been described as being coupled to one another, such components may be integrated with one another or distributed in any suitable fashion.
Moreover, while illustrative embodiments have been described herein, the scope includes any and all embodiments having equivalent elements, modifications, omissions, combinations (e.g., of aspects across various embodiments) , adaptations or alterations based on the present disclosure. The elements in the claims are to be interpreted broadly based on the language employed in the claims and not limited to examples described in the present specification or during the prosecution of the application, which examples are to be construed as nonexclusive. Further, the steps of the disclosed methods can be modified in any manner, including reordering steps and/or inserting or deleting steps.
The features and advantages of the disclosure are apparent from the detailed specification, and thus, it is intended that the appended claims cover all systems and methods falling within the true spirit and scope of the disclosure. As used herein, the indefinite articles  “a” and “an” mean “one or more. ” Similarly, the use of a plural term does not necessarily denote a plurality unless it is unambiguous in the given context. Further, since numerous modifications and variations will readily occur from studying the present disclosure, it is not desired to limit the disclosure to the exact construction and operation illustrated and described, and accordingly, all suitable modifications and equivalents may be resorted to, falling within the scope of the disclosure.
As used herein, unless specifically stated otherwise, the term “or” encompasses all possible combinations, except where infeasible. For example, if it is stated that a component may include A or B, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or A and B. As a second example, if it is stated that a component may include A, B, or C, then, unless specifically stated otherwise or infeasible, the component may include A, or B, or C, or A and B, or A and C, or B and C, or A and B and C.
Other embodiments will be apparent from consideration of the specification and practice of the embodiments disclosed herein. It is intended that the specification and examples be considered as example only, with a true scope and spirit of the disclosed embodiments being indicated by the following claims.

Claims (20)

  1. A method for processing a neural network, comprising:
    receiving a plurality of inputs at a processing unit, wherein the processing unit includes a plurality of cores, and a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;
    receiving a first input of the plurality of inputs at a core of the plurality of cores;
    performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
    communicating the first input from the core to another core of the plurality of cores.
  2. The method according to claim 1, further comprising:
    receiving, at the core, a second input from yet another core of the plurality of cores;
    performing, at the core, a second computation using the first part of the weight matrix and the second input; and
    communicating a result of the second computation from the core to the yet another core.
  3. The method according to claim 1, wherein:
    communicating the first input is performed in parallel with the first computation.
  4. The method according to claim 1, further comprising:
    communicating the second input from the core to the another core.
  5. The method according to claim 4, wherein communicating the second input is performed in parallel with the second computation.
  6. The method according to claim 1, further comprising:
    performing, at the another core, a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
  7. The method according to claim 1, wherein receiving the first input at the core comprises loading the first input from an external memory to the core.
  8. The method according to claim 1, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
  9. A heterogeneous acceleration processing unit (HAPU) , comprising:
    a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
    receive a first input of a plurality of inputs; and
    perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
    a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
  10. The heterogeneous acceleration processing unit according to claim 9, wherein the core is configured to:
    receive a second input from yet another core of the plurality of cores; and
    perform a second computation using the first part of the weight matrix and the second input, and
    wherein the communication unit is configured to communicate a result of the second computation from the core to the yet another core.
  11. The heterogeneous acceleration processing unit according to claim 9, wherein the communication of the first input from the core to the another core by the communication unit is performed in parallel with the first computation by the core.
  12. The heterogeneous acceleration processing unit according to claim 10, wherein the communication unit is configured to communicate the second input from the core to the another core.
  13. The heterogeneous acceleration processing unit according to claim 12, wherein the communication of the second input from the core to the another core by the communication unit is performed in parallel with the second computation by the core.
  14. The heterogeneous acceleration processing unit according to claim 9, wherein the another core is configured to perform a computation using the first input and a second part of the weight matrix, wherein the another core is associated with the second part of the weight matrix.
  15. The heterogeneous acceleration processing unit according to claim 9, wherein the communication unit is configured to load the first input from an external memory to the core.
  16. The heterogeneous acceleration processing unit according to claim 9, wherein the core is configured to receive each of the plurality of inputs at a plurality of different time instances, and perform a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core is configured to use a different input of the plurality of inputs.
  17. The heterogeneous acceleration processing unit according to claim 9, wherein the core comprises:
    a local memory for storing the first part of the weight matrix and a result of the first computation;
    at least one computation engine communicatively coupled with the local memory and configured to perform the first computation; and
    a transmission engine communicatively coupled with the local memory and configured to transmit the first input.
  18. A non-transitory computer readable storage media storing a set of instructions that are executable by one or more processors in a computing device comprising a processing unit including a plurality of cores to cause the processing unit to perform a method comprising:
    receiving a plurality of inputs;
    receiving a first input of the plurality of inputs at a core of the plurality of cores, wherein a weight matrix is divided into a plurality of parts each of which is assigned to one of the plurality of cores;
    performing, at the core, a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
    communicating the first input from the core to another core of the plurality of cores.
  19. The non-transitory computer readable storage media according to claim 18, wherein the core receives each of the plurality of inputs at a plurality of different time instances, and the core performs a plurality of rounds of computation using the first part of the weight matrix and one of the plurality of inputs, and wherein in each round of computation, the core uses a different input of the plurality of inputs.
  20. A terminal, comprising:
    a host unit; and
    a heterogeneous acceleration processing unit (HAPU) communicatively coupled to the host unit, comprising:
    a plurality of cores, a weight matrix being divided into a plurality of parts each of which is assigned to one of the plurality of cores, a core of the plurality of cores being configured to:
    receive a first input of a plurality of inputs; and
    perform a first computation using the first input and a first part of the weight matrix, wherein the first part of the weight matrix is associated with the core; and
    a communication unit communicatively coupled with the plurality of cores and configured to communicate the first input from the core to another core of the plurality of cores.
PCT/CN2020/070943 2020-01-08 2020-01-08 Methods and apparatuses for processing neural network WO2021138842A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/070943 WO2021138842A1 (en) 2020-01-08 2020-01-08 Methods and apparatuses for processing neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/070943 WO2021138842A1 (en) 2020-01-08 2020-01-08 Methods and apparatuses for processing neural network

Publications (1)

Publication Number Publication Date
WO2021138842A1 true WO2021138842A1 (en) 2021-07-15

Family

ID=76787679

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/070943 WO2021138842A1 (en) 2020-01-08 2020-01-08 Methods and apparatuses for processing neural network

Country Status (1)

Country Link
WO (1) WO2021138842A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220284658A1 (en) * 2021-03-03 2022-09-08 Nvidia Corporation Fully-fused neural network execution
US11610360B2 (en) 2021-03-03 2023-03-21 Nvidia Corporation Real-time neural network radiance caching for path tracing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193361A1 (en) * 2015-12-31 2017-07-06 Microsoft Technology Licensing, Llc Neural network training performance optimization framework
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
US20180046897A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Hardware accelerator for compressed rnn on fpga
US20190303750A1 (en) * 2019-06-17 2019-10-03 Intel Corporation Reconfigurable memory compression techniques for deep neural networks
US20190362223A1 (en) * 2017-10-20 2019-11-28 Google Llc Parallel processing for signal generation neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170193368A1 (en) * 2015-12-30 2017-07-06 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
US20170193361A1 (en) * 2015-12-31 2017-07-06 Microsoft Technology Licensing, Llc Neural network training performance optimization framework
US20180046897A1 (en) * 2016-08-12 2018-02-15 Beijing Deephi Intelligence Technology Co., Ltd. Hardware accelerator for compressed rnn on fpga
US20190362223A1 (en) * 2017-10-20 2019-11-28 Google Llc Parallel processing for signal generation neural networks
US20190303750A1 (en) * 2019-06-17 2019-10-03 Intel Corporation Reconfigurable memory compression techniques for deep neural networks

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220284658A1 (en) * 2021-03-03 2022-09-08 Nvidia Corporation Fully-fused neural network execution
US11610360B2 (en) 2021-03-03 2023-03-21 Nvidia Corporation Real-time neural network radiance caching for path tracing
US11631210B2 (en) * 2021-03-03 2023-04-18 Nvidia Corporation Fully-fused neural network execution
US11935179B2 (en) 2021-03-03 2024-03-19 Nvidia Corporation Fully-fused neural network execution

Similar Documents

Publication Publication Date Title
US20210264220A1 (en) Method and system for updating embedding tables for machine learning models
JP7335312B2 (en) Versatile parallel processing architecture
US11568258B2 (en) Operation method
US11586601B2 (en) Apparatus and method for representation of a sparse matrix in a neural network
US20210065005A1 (en) Systems and methods for providing vector-wise sparsity in a neural network
US11768911B2 (en) Method and apparatus for execution of neural network
US11500811B2 (en) Apparatuses and methods for map reduce
JP7451614B2 (en) On-chip computational network
US11921814B2 (en) Method and device for matrix multiplication optimization using vector registers
US20210089873A1 (en) Apparatus and system for execution of neural network
US11694075B2 (en) Partitioning control dependency edge in computation graph
US20210319289A1 (en) Frequency domain neural network accelerator
WO2021138842A1 (en) Methods and apparatuses for processing neural network
US20210201110A1 (en) Methods and systems for performing inference with a neural network
US11562217B2 (en) Apparatuses and methods for approximating nonlinear function
US11409839B2 (en) Programmable and hierarchical control of execution of GEMM operation on accelerator
US20220067509A1 (en) System and method for learning from partial compressed representation
US11663446B2 (en) Data reuse and efficient processing scheme in executing convolutional neural network
US20210150311A1 (en) Data layout conscious processing in memory architecture for executing neural network model
US20210209462A1 (en) Method and system for processing a neural network
CN113077042B (en) Data reuse and efficient processing method for convolutional neural network
JP2021507368A (en) Multiple pipeline architecture with special number detection
US11915138B2 (en) Method and device for reducing a size of a neural network model
KR20230063791A (en) AI core, AI core system and load/store method of AI core system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912434

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912434

Country of ref document: EP

Kind code of ref document: A1