WO2020125092A1 - 计算装置及板卡 - Google Patents

计算装置及板卡 Download PDF

Info

Publication number
WO2020125092A1
WO2020125092A1 PCT/CN2019/105932 CN2019105932W WO2020125092A1 WO 2020125092 A1 WO2020125092 A1 WO 2020125092A1 CN 2019105932 W CN2019105932 W CN 2019105932W WO 2020125092 A1 WO2020125092 A1 WO 2020125092A1
Authority
WO
WIPO (PCT)
Prior art keywords
gate
result
operator
processing circuit
output
Prior art date
Application number
PCT/CN2019/105932
Other languages
English (en)
French (fr)
Inventor
孟小甫
陈翊辉
蓝思明
齐豪
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811560966.5A external-priority patent/CN109711540B/zh
Priority claimed from CN201811579542.3A external-priority patent/CN109670581B/zh
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2020125092A1 publication Critical patent/WO2020125092A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • This application relates to the field of neural networks, in particular to a computing device and a board.
  • LSTM Long and short time memory network
  • RNN time recursive neural network
  • LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.
  • the LSTM network shows better performance. It is very suitable for learning from experience in order to classify, process and predict the time series when there is an unknown size of time between important events.
  • LSTM networks are widely used.
  • the existing LSTM network is implemented based on a general-purpose processor, and the existing processor has high energy consumption for performing LSTM operations.
  • This application provides a calculation method and related products, which can increase the processing speed of LSTM and save power consumption.
  • a computing device for performing an LSTM operation includes: an input gate, a forget gate, an output gate, and an update status gate, and the computing device includes: an arithmetic unit, a controller unit, and a storage unit ;
  • the storage unit is used to store the LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
  • the controller unit is used to obtain input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and input data Xt, weight data, input state value Ct- 1.
  • the input result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
  • the arithmetic unit is used to perform input gate operation, forget gate operation, output gate operation and update status gate operation based on input data Xt, weight data, input result ht-1 and LSTM operator to obtain each gate According to the output result of the input state value Ct-1 and the output result of each gate, the output data ht and the output state value Ct are obtained.
  • the arithmetic unit includes: a master processing circuit and a slave processing circuit;
  • the controller unit is specifically configured to construct a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator based on the LSTM operator;
  • the main processing circuit is specifically used to reorder input data Xt, weight data and input status values according to a sorting operator.
  • the weight data includes: weight data of each gate, and then separates each according to a split algorithm
  • the weight data of the gate and the multiplication operator are broadcast to the slave processing circuit to split the input data and input state value into multiple input data blocks and multiple input state data blocks, and multiple input data blocks and multiple input state data
  • the block is distributed to the slave processing circuit;
  • the slave processing circuit is configured to perform a multiplication operation on the multiple input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and to convert the multiple input state data blocks according to the multiplication operator Perform multiplication with the weight data of each gate to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
  • the main processing circuit is used to sort the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, and perform an offset operation on the sorting result of each gate according to the addition operator to obtain the calculation result of each gate, according to the sorting
  • the operator sorts the intermediate result of each state to obtain the state sorting result of each gate, and performs an offset operation on the sorting result of each gate according to the addition operator to obtain the state calculation result of each gate; the operation result of each gate is calculated according to the addition operator And the state calculation results of each gate are added correspondingly and then subjected to subsequent processing to obtain the output result of each gate.
  • the main processing circuit is specifically configured to multiply the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and to update the output result gt of the state gate according to the multiplication operator
  • the second result is multiplied by the output result it of the input gate, and the first result and the second result are added to obtain the output state value Ct.
  • the main processing circuit is specifically configured to perform an activation operation on the output state value Ct according to the activation operator to obtain an activation result, and multiply the output result Ot of the output gate and the activation result to obtain the output result ht.
  • the subsequent processing specifically includes:
  • the subsequent processing is to activate the tanh function by operation.
  • the main processing circuit is further configured to use the output data ht as the input result at the next moment and the output state value Ct as the input state value at the next moment.
  • the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and a root port of the tree module Connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;
  • the tree module is used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
  • the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit,
  • the branch processing circuit is used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
  • the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to the K slave processing circuits in the plurality of slave processing circuits, the k basic circuits being: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column ;
  • the K slave processing circuits are used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
  • the main processing circuit includes: a conversion processing circuit
  • the conversion processing circuit is configured to perform conversion processing on the data, specifically: performing interchange between the first data structure and the second data structure on the data received by the main processing circuit.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit;
  • the multiplication processing circuit is used to perform a product operation on the element value in the received input data block and the element value in the corresponding position in the weight of each gate to obtain the product result of each gate;
  • the element value and the element value of the corresponding position in the weight of each gate perform a product operation to obtain another product result of each gate;
  • the accumulation processing circuit is configured to perform an accumulation operation on the product result of each gate to obtain an intermediate result of each gate, and perform an accumulation operation on another product result of each gate to obtain an intermediate result of each gate state.
  • the tree module is an n-tree structure, where n is an integer greater than or equal to 2.
  • an embodiment of the present application provides an LSTM computing device, the LSTM computing device includes one or more computing devices provided in the first aspect, for obtaining data to be calculated and control information from other processing devices, and Perform the specified LSTM operation, and pass the execution result to other processing devices through the I/O interface;
  • the multiple computing devices may be connected and transmit data through a specific structure
  • a plurality of the computing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale LSTM operations; a plurality of the computing devices share the same control system or have their own control systems; A plurality of the computing devices share memory or have their own memories; the interconnection mode of the plurality of computing devices is any interconnection topology.
  • a combined processing device including the LSTM computing device of the second aspect, a universal interconnection interface, and other processing devices;
  • the LSTM computing device interacts with the other processing device to jointly complete the calculation operation specified by the user.
  • a neural network chip in a fourth aspect, includes the computing device provided in the first aspect or the LSTM computing device provided in the second aspect or the combined processing device provided in the third aspect.
  • an electronic device in a fifth aspect, includes the chip as provided in the fourth aspect.
  • a board card including: a storage device, an interface device and a control device, and a neural network chip provided in the fourth aspect;
  • the neural network chip is respectively connected to the storage device, the control device and the interface device;
  • the storage device is used to store data
  • the interface device is used to realize data transmission between the chip and an external device
  • the control device is used for monitoring the state of the chip.
  • an embodiment of the present application further provides an LSTM operation method.
  • the LSTM includes: the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate, and the computing device includes: an arithmetic unit and a controller Unit, storage unit; the storage unit stores: LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
  • the method includes the following steps:
  • the controller unit acquires input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and inputs input data Xt, weight data, input state value Ct-1, input The result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
  • the arithmetic unit performs input gate operation, forget gate operation, output gate operation, and update status gate operation according to input data Xt, weight data, input result ht-1, and LSTM operator to obtain output results of each gate According to the input state value Ct-1 and the output result of each gate, the output data ht and the output state value Ct are obtained.
  • the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
  • the vehicle includes an airplane, ship, and/or vehicle;
  • the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood;
  • the medical Equipment includes nuclear magnetic resonance, ultrasound, and/or electrocardiograph.
  • an operation method of a gated circulation unit GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer.
  • the operation method is applied to a computing device,
  • the calculation method includes:
  • the computing device obtains the input data x t input at the time of the input layer t, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the computing device calls a pre-constructed GRU operator from a pre-packaged function library
  • the computing device inputs input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
  • the inputting the input data x t , the output data h t-1 , and the weight value into the pre-constructed GRU operator, and obtaining the output result h t specifically include:
  • the method before calling the pre-constructed GRU operator from the pre-packaged function library, the method further includes:
  • the computing device obtains the offset.
  • the input data x t , the output data h t-1 , and the weight value are input to the operator corresponding to the reset gate in the GRU operator to obtain the output of the reset gate Results r t specifically include:
  • the activation type of the first activation operator is sigmoid
  • the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
  • the input data x t , the output data h t-1 , and the weight value are input to the operator corresponding to the update gate in the GRU operator to obtain the output result z of the update gate t specifically includes:
  • the activation type of the second activation operator is sigmoid
  • the second summation result is input into the second activation operator for activation, and the output result z t of the update gate is obtained.
  • the input data x t , output data h t-1 , weight value, and output result r t of the reset gate are input to the GRU operator corresponding to the current memory gate
  • the output result n t of the current memory gate specifically includes:
  • the output data h t-1 , the weight and the offset are input to the sixth multiplication operator, and the sum (W hn *h t-1 +b hn ) is calculated to obtain a sixth operation result, where W hn and b hn is the second weight and the second offset corresponding to the current memory gate respectively among the weights and the offset;
  • the sixth operation result and the output result r t of the reset gate are input to the first vector multiplication operator, and the output data r t of the reset gate is multiplied by the sixth operation result to obtain the first Dot product
  • the third summation result is input to the third activation operator for activation to obtain the current memory gate output result n t .
  • the gate of the output update z t, the current output of the memory gate and the output data of n t h t-1 is inputted to the operator GRU in the output layer corresponding to operator
  • the output result h t specifically includes:
  • the updated gate output and the current output z t n t memory gate input to said second vector multiplication operator performs point multiplication, the result obtained by the second point;
  • the first difference result and the third point product result are input to the fourth addition operator for summation to obtain an output result h t .
  • the computing device includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit; and the method specifically includes:
  • the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the controller unit calls a pre-constructed GRU operator from a pre-packaged function library
  • the controller unit sends input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
  • the master processing circuit splits the input data x t into multiple input data blocks, distributes the multiple input data blocks and output data h t-1 to the slave processing circuit, and distributes the weights and some of the GRU operators Broadcast to slave processing circuit;
  • the main processing circuit From the processing circuit, input the received input data block, output data h t-1 and the weight value into the operator corresponding to the reset gate in the partial operator, obtain the intermediate result of the reset gate, and send the intermediate result to The main processing circuit, the main processing circuit inputs the intermediate result to the operator corresponding to the reset gate in another part of the operators in the GRU operator, to obtain the output result r t of the reset gate;
  • the master processing circuit distributes the output result r t of the reset gate to the slave processing circuit
  • the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
  • the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
  • the main processing circuit updates the output gate z t, the current output of the memory gate n t, h t-1 output data is inputted to another portion of the operator and the operator corresponding to the output layer, the output result obtained h t.
  • the method when the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight, the method further The method includes: the controller unit obtains the bias, and sends the bias to the master processing circuit; the master processing circuit broadcasts the bias to the slave processing circuit.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and obtaining the intermediate result of the output of the reset gate specifically includes:
  • the multiplication processing circuit inputs the received input data block and the weight and offset into the first multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight , And perform the sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the second multiplication operator, and The element value in the output data h t-1 and the element value at the corresponding position in the weight perform a product operation, and the product result is summed with another element value at the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W ir *x t +b ir ) of the reset gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hr *h t-1 +b hr );
  • the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit; the obtaining the output result r t of the reset gate specifically includes:
  • the addition processing circuit inputs the input intermediate result and the output intermediate result of the reset gate into the first addition operator, and performs a sum operation on the input intermediate result and the output intermediate result to obtain a first summation result;
  • the activation processing circuit inputs the first summation result into the first activation operator, performs a sigmoid activation operation on the first summation result, and obtains an output result r t of the reset gate;
  • the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the obtaining the intermediate output result of the update gate specifically includes:
  • the multiplication processing circuit inputs the received input data block and the weight and offset to the third multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, The product result and the element value of the corresponding position in the offset are summed to obtain the product result; the received output data h t-1 and the weight and offset are input to the fourth multiplication operator. Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the other element value in the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hz *h t-1 +b hz );
  • the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit;
  • the output result z t of the update gate specifically includes:
  • the addition processing circuit inputs the input intermediate result and the output intermediate of the update gate to the second addition operator, performs a sum operation on the input intermediate result and the output intermediate, and obtains a second summation result;
  • the activation processing circuit inputs the second summation result into the second activation operator, performs a sigmoid activation operation on the second summation result, and obtains the output result z t of the update gate;
  • the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and obtaining the intermediate output result of the current memory gate specifically includes:
  • the multiplication processing circuit inputs the received input data block, weight and offset to the fifth multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, And sum the product result with the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 , the weight and the offset into the sixth multiplication operator, and Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the other element value in the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W in *x t +b in ) of the current memory gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the current memory gate ( W nz *h t-1 +b nz );
  • the multiplication processing circuit inputs the output result r t of the reset gate into the first vector multiplication operator, and performs a dot product operation on the output result r t of the reset gate and the intermediate result of the output of the current memory gate to obtain the first point Multiply the result
  • the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit; the obtaining the output result n t of the current memory gate specifically includes:
  • the addition processing circuit inputs the input intermediate result of the current memory gate and the first dot product to the third addition operator, and performs a sum operation on the input intermediate result of the current memory gate and the first dot product to obtain the third Sum the results;
  • the activation processing circuit inputs the third summation result into the third activation operator, performs a tanh activation operation on the third summation result, and obtains the output result n t of the current memory gate;
  • the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
  • the master processing circuit includes an addition processing circuit
  • the slave processing circuit includes a multiplication processing circuit
  • the output result of the determining output layer specifically includes:
  • the main processing circuit updates the output gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
  • the multiplication result output circuit outputs the updated gate z t and the current memory is input to the gate of a second n t vector multiplication operator, the update of the output gate and the output current z t n t execution memory door Dot product operation, get the second dot product result, input the output result z t and output data h t-1 of the received update gate into the third vector multiplication operator, and output the output result z t and output data of the update gate h t-1 performs the dot product operation to obtain the third dot product result, and sends the second dot product result and the third dot product result to the main processing circuit;
  • the addition processing circuit inputs the current memory gate output result n t and the second point multiplication result into the first subtraction operator, and performs a subtraction operation on the current memory gate output result n t and the second point multiplication result to obtain the first A difference result, the third dot product and the first difference result are input to the fourth addition operator, and the sum operation is performed on the third dot product and the first difference result to obtain the output result h t ;
  • the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth addition operator are corresponding to the output layer in another operator operator.
  • the main processing circuit includes a conversion processing circuit
  • the conversion processing circuit inputs the output result h t into the shaping operator and the split operator in another part of the operators, and adjusts the data format of the output result h t to a preset format to obtain the final output result.
  • a computing device is provided, the computing device is used to perform GRU operations, the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer;
  • the computing device is used to obtain the input data x t input at the input layer t time, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the computing device is used to call a pre-constructed GRU operator from a pre-packaged function library
  • the computing device is configured to input input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
  • the computing device includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit;
  • the controller unit is used to obtain the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the controller unit is used to call a pre-constructed GRU operator from a pre-packaged function library
  • the controller unit is configured to send input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
  • the main processing circuit is used for splitting input data x t into multiple input data blocks, splitting output data h t-1 into multiple output data h t-1 , splitting multiple input data blocks and output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit;
  • the processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the reset gate in some operators, to obtain the intermediate result corresponding to the reset gate, and The intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result into the operator corresponding to the reset gate in another part of the operators in the GRU operator to obtain the output result r t of the reset gate;
  • the main processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the update gate in some operators, to obtain the intermediate result of the update gate, and send the intermediate result To the main processing circuit, the main processing circuit inputs the intermediate result to the operator corresponding to the update gate in another part of the operator, and obtains the output result of the update gate z t
  • the master processing circuit is used to distribute the output result r t of the reset gate to the slave processing circuit;
  • the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
  • the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
  • the main processing circuit for updating the gate output z t, the current output of the memory gate n t, h t-1 output data is input to the Operator Operator and another portion corresponding to the output layer, the output obtained h t .
  • the controller unit for example, when acquiring the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight, the control The processor unit is also used to obtain an offset and send the offset to the master processing circuit; the master processing circuit is also used to broadcast the offset to the slave processing circuit.
  • the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and the tree module The root port of the is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one of the slave processing circuits in the slave processing circuits;
  • the tree module is configured to forward the input data block, output data h t-1 , weight, offset, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit;
  • the branch processing circuit is configured to forward input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit K slave processing circuits of the plurality of slave processing circuits are connected, and the k basic circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m number in the first column Slave processing circuit
  • the K slave processing circuits are used for forwarding input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; if an intermediate result of the output of the reset gate is obtained,
  • the multiplication processing circuit is configured to input the received input data block and the weight value and the offset into the first multiplication operator, and compare the element value in the received input data block with the element value in the corresponding position in the weight value Perform a product operation, and perform a sum operation on the product result and the element value at the corresponding position in the offset to obtain a product result; input the received output data h t-1 , weight, and offset to the second multiplication operator, Perform a product operation on the element value in the received output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another Product result;
  • the accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain an input intermediate result (W ir *x t +b ir ) of the reset gate, and perform an accumulation operation on another product result to obtain the reset gate Output intermediate results (W hr *h t-1 +b hr );
  • the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result r t of the reset gate is obtained,
  • the addition processing circuit is configured to input the input intermediate result and the output intermediate result of the reset gate into the first addition operator, perform a sum operation on the input intermediate result and the output intermediate result, and obtain a first summation result;
  • the activation processing circuit is configured to input the first summation result into the first activation operator, perform a sigmoid activation operation on the first summation result, and obtain an output result r t of the reset gate;
  • the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; when an intermediate result of the output of the update gate is obtained,
  • the multiplication processing circuit is configured to input the received input data block, the weight value and the offset to the third multiplication operator, and execute the element value in the received input data block and the element value in the corresponding position in the weight value Product operation, and perform a sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the fourth multiplication operator, The element value in the received output data h t-1 performs the product operation with the element value at the corresponding position in the weight, and performs a sum operation on the product result and another element value at the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and perform an accumulation operation on another product result to obtain an output intermediate of the reset gate Results (W hz *h t-1 +b hz );
  • the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result z t of the update gate is obtained,
  • the addition processing circuit is configured to input the input intermediate result and the output intermediate result of the update gate into the second addition operator, and perform a sum operation on the input intermediate result and the output intermediate result to obtain a second summation result;
  • the activation processing circuit is configured to input the second summation result into the second activation operator, perform a sigmoid activation operation on the second summation result, and obtain an output result z t of the update gate;
  • the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; when obtaining the intermediate output of the current memory gate,
  • the multiplication processing circuit is configured to input the received input data block, weight and offset to the fifth multiplication operator, and execute the element value in the received input data block and the element value in the corresponding position in the weight Product operation, and perform a sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the sixth multiplication operator, The element value in the received output data h t-1 performs the product operation with the element value at the corresponding position in the weight, and performs a sum operation on the product result and another element value at the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit is used to accumulate the product result to obtain the input intermediate result (W in *x t +b in ) of the current memory gate, and to accumulate another product result to obtain the output of the current memory gate Intermediate result (W nz *h t-1 +b nz );
  • the multiplication processing circuit is configured to input the reset gate output result r t into the first vector multiplication operator, and perform a dot product operation on the reset gate output result r t and the current memory gate output intermediate result to obtain The first point multiplication result;
  • the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result n t of the current memory gate is obtained,
  • the addition processing circuit is configured to input the input intermediate result and the first point multiplication result of the current memory gate into the third addition operator, and perform a sum operation on the input intermediate result and the point multiplication result of the current memory gate to obtain the first Three summation results;
  • the activation processing circuit is configured to input the third summation result into the third activation operator, perform a tanh activation operation on the third summation result, and obtain an output result n t of the current memory gate;
  • the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
  • the master processing circuit includes an addition processing circuit
  • the slave processing circuit includes a multiplication processing circuit
  • the main processing circuit for updating the output of gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
  • the multiplication circuit for outputting the result of the updating door z t and the current output of the memory is input to the gate of the second n T vector multiplication operator, update of the output gate of the output z t and the current memory gate n t perform dot, the dot product to obtain a second result, and outputs the result of received update gate and an output data Z t h t-1 input to the third vector multiplication operator, update the output gate and the output data Z t h t-1 performs the dot product operation to obtain the third dot product result, and sends the second dot product result and the third dot product result to the main processing circuit;
  • the addition processing circuit is configured to input the current memory gate output result n t and the second point multiplication result into the first subtraction operator, and perform a subtraction operation on the current memory gate output result n t and point multiplication result to obtain The first difference result, the third dot product and the first difference result are input to the fourth addition operator, and the third dot product and the first difference result are summed to obtain the output result ht;
  • the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and fourth addition operator are corresponding to the output layer in another operator operator.
  • the main processing circuit includes a conversion processing circuit
  • the conversion processing circuit is used for inputting the output result h t into a shaping operator and a split operator in another part of the operator, adjusting the data format of the output result h t to a preset format, and obtaining a final output result.
  • a neural network chip is provided, wherein the neural network chip includes the computing device provided in the ninth aspect.
  • an electronic device including the chip provided in the tenth aspect.
  • Figure 1-1 is a schematic diagram of a LSTM structure
  • 1-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • 1-2a is a schematic structural diagram of an arithmetic unit provided by an embodiment of the present application.
  • FIG. 1-3 are schematic structural diagrams of another computing device provided by this application.
  • 1-3a is a schematic structural diagram of a main processing circuit provided by the present application.
  • Figure 1-4a is a schematic diagram of a structure of a sending end of a tree module provided by this application.
  • 1-4b is a schematic structural diagram of a receiving end of a tree module provided by this application.
  • Figure 1-4c is a schematic diagram of the binary tree structure provided by this application.
  • FIGS. 1-5 are structural diagrams of a computing device provided by an embodiment of the present application.
  • Figures 1-6 are schematic flowcharts of the LSTM calculation method provided by an embodiment of the present application.
  • FIGS. 1-7 are structural diagrams of a combined processing device provided by an embodiment of the present application.
  • FIGS. 1-8 are structural diagrams of another combined processing device provided by an embodiment of the present application.
  • 1-9 are schematic structural diagrams of a board provided by an embodiment of the present application.
  • Figure 2-1 is a schematic diagram of a GRU structure
  • 2-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • 2-2a is a schematic structural diagram of an arithmetic unit provided by an embodiment of the present application.
  • Figure 2-3 is a schematic structural diagram of another computing device provided by the present application.
  • Figure 2-3a is a schematic structural diagram of a main processing circuit provided by the present application.
  • Figure 2-3b is a schematic diagram of the structure of the slave processing circuit provided by the present application.
  • Figure 2-4a is a schematic diagram of the structure of a sending end of a tree module provided by this application.
  • Figure 2-4b is a schematic structural diagram of a receiving end of a tree module provided by this application.
  • Figure 2-4c is a schematic diagram of the binary tree structure provided by this application.
  • FIGS. 2-5 are structural diagrams of a computing device provided by an embodiment of the present application.
  • 2-6 are schematic flowcharts of a GRU calculation method provided by an embodiment of the present application.
  • FIGS. 2-7 are structural diagrams of a combined processing device provided by an embodiment of the present application.
  • FIGS. 2-8 are structural diagrams of another combined processing device provided by an embodiment of the present application.
  • 2-9 are schematic structural diagrams of a board provided by an embodiment of the present application.
  • FIG. 1-1 is a schematic diagram of an LSTM. As shown in FIG. 1-1, the LSTM includes: an input gate, a forget gate, an update status unit, and an output gate.
  • the corresponding calculation formulas are as follows:
  • xt is the input data at time t
  • ht-1 represents the output data at time t-1
  • Wf, Wi, Wg, and Wo represent the weight vectors corresponding to the forget gate, input gate, update status unit, and output gate, respectively.
  • Bf, bi, bc and bo represent the offsets corresponding to the forget gate, input gate, update state unit and output gate respectively;
  • ft represents the output of the forget gate, which is selectively multiplied by the state unit at time t-1 to select
  • Forget the past state unit value it means the output of the input gate, and multiply the obtained candidate state value at time t to selectively add the candidate state value at time t to the update state unit;
  • gt means the calculated at time t Candidate state value;
  • ct means a new state value obtained by selectively forgetting the state value at time t-1 and selectively adding the state value at time t, ct will be used at the time of calculating the final output and transmitted to the next One moment;
  • Ot represents the selection condition that needs to be output as the result part in the state unit at time t;
  • ht represents the output at time t, and it will also be transmitted to the next moment (that is, t+1 moment);
  • is a vector by element The product of operations; ⁇ is the sigmoid function, and the calculation formula is:
  • this application puts Wf, Wi, Wg and Wo into a matrix W, and bf, bi, bc and bo into a matrix b.
  • FIG. 1-2 which is a computing device provided by this application.
  • a computing device is provided for performing LSTM operations.
  • the computing device includes: a controller unit 11, an arithmetic unit 12, and a storage unit 10, wherein the controller unit 11 and the arithmetic unit 12 1.
  • the storage unit 10 is connected, and the arithmetic unit 12 includes: a master processing circuit 101 and a slave processing circuit 102 (which may be one or more slave processing circuits, preferentially selecting multiple slave processing circuits);
  • the above-mentioned main processing circuit itself includes a memory (for example, a memory or a register).
  • the memory can store some data of the main processing circuit, and the slave processing circuit can choose to carry the memory.
  • LSTM includes: input gate, forget gate, output gate and update status gate;
  • the storage unit 10 is used to store the LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
  • the controller unit 11 is used to obtain input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and input data Xt, weight data, input state value Ct-1 , The input result ht-1, and the LSTM arithmetic operator are sent to the arithmetic unit,
  • the arithmetic unit 12 is used to perform input gate operation, forget gate operation, output gate operation, and update status gate operation based on input data Xt, weight data, input result ht-1, and LSTM arithmetic operator For the output result, the output data ht and the output state value Ct are obtained according to the input state value Ct-1 and the output result of each gate.
  • the aforementioned controller unit is specifically configured to construct a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator based on the LSTM operator;
  • the main processing circuit is specifically used to reorder input data Xt, weight data and input status values according to a sorting operator.
  • the weight data includes: weight data of each gate, and then separates each according to a split algorithm
  • the weight data of the gate and the multiplication operator are broadcast to the slave processing circuit to split the input data and input state value into multiple input data blocks and multiple input state data blocks, and multiple input data blocks and multiple input state data
  • the block is distributed to the slave processing circuit;
  • the slave processing circuit is configured to perform a multiplication operation on the multiple input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and to convert the multiple input state data blocks according to the multiplication operator Perform multiplication with the weight data of each gate to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
  • each gate in the above gates is relatively independent, and the calculation result is also relatively independent, that is, each gate has its own weight data, such as Wf, Wi, Wg, and Wo represent 4 gates respectively Weight data.
  • the foregoing multiplying the multiple input data blocks with the weight data of each gate according to the multiplication operator to obtain the intermediate result of each gate may specifically include:
  • the intermediate result of the status of each door described above is similar to the intermediate result of each door, and will not be repeated here.
  • the main processing circuit is used to sort the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, and perform an offset operation on the sorting result of each gate according to the addition operator to obtain the calculation result of each gate, according to the sorting
  • the operator sorts the intermediate nodes of each state to obtain the state sorting result of each gate, and performs the offset operation to obtain the state calculation result of each gate according to the addition operator to sort the state sorting result of each gate; the operation result of each gate according to the addition operator And the state calculation results of each gate are added correspondingly and then subjected to subsequent processing to obtain the output result of each gate.
  • the technical solution provided in this application sets the operation unit to a master-slave structure.
  • the input data at this moment and the output data of the forget gate are split and processed in parallel, so that the master processing circuit and the slave processing circuit can Carry out parallel operations on the parts with a large amount of calculation, thereby increasing the operation speed, saving operation time, and reducing power consumption.
  • the main processing circuit is specifically configured to multiply the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and to update the output result gt of the state gate according to the multiplication operator
  • the second result is multiplied by the output result it of the input gate, and the first result and the second result are added to obtain the output state value Ct.
  • the main processing circuit is specifically configured to perform an activation operation on the output state value Ct according to the activation operator to obtain an activation result, and multiply the output result Ot of the output gate and the activation result to obtain the output result ht.
  • the subsequent processing specifically includes:
  • the subsequent processing is the activation operation tanh function.
  • the main processing circuit is further configured to use the output data ht as the input result at the next moment and the output state value Ct as the input state value at the next moment.
  • the above LSTM can contain multiple hidden layers, h is an integer greater than or equal to 2, for the hth hidden layer can be any intermediate hidden layer operation in LSTM, multiple LSTM operations, the implementation process is, in the forward operation
  • the forward operation is performed at the previous time t-1 and the output result t-1 is obtained
  • the operator at the current time t will use the output result t-1 at the previous time as the input data of the forget gate at the next time
  • Forget gate uses sigmoid to determine the passing rate of the output result t-1 at the above time, so that the output result t at the time of forget gate t is obtained, the output result t and the weight are calculated, and the other part is calculated as the input of the input layer at time t
  • the data is input to the neuron as another part, and then the two parts of the input neuron are respectively multiplied with the weights to obtain two operation results, and the two operation results are added to obtain the output result at time t, and then the output result at time t
  • the above computing device may further include: a direct memory access unit 50, and the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used to store a calculation operator; the register , Used to store the input data and scalar; the cache is a high-speed temporary cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an operator storage unit 110, an operator processing unit 111, and a storage queue unit 113;
  • the operator storage unit 110 is configured to store the calculation operator associated with the LSTM operation
  • the operator processing unit 111 is configured to analyze the calculation operator to obtain multiple operation operators
  • the storage queue unit 113 is used to store an operator queue.
  • the operator queue includes: a plurality of operation operators or a plurality of calculation operators to be executed in the order of the queue.
  • controller unit may further include:
  • the dependency processing unit 108 is configured to determine whether there is an association relationship between the first operator and the zeroth operator before the first operator when there are multiple operators, such as the first If there is an association relationship between the operation operator and the zeroth operation operator, the first operation operator is cached in the operator storage unit, and after the execution of the zeroth operation operator is completed, the The sub-storage unit extracts the first arithmetic operator and transmits it to the arithmetic unit;
  • Said determining whether there is an association relationship between the first operator and the zeroth operator before the first operator includes:
  • the zeroth storage address interval of the matrix if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic operator has an association relationship with the zeroth arithmetic operator, If the first storage address interval and the zeroth storage address interval do not overlap, it is determined that the first operator and the zeroth operator do not have an association relationship.
  • the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
  • multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that ,
  • the K slave processing circuits shown in Figure 1-3 include only n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, that is, the k
  • the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
  • K slave processing circuits for data between the master processing circuit and multiple slave processing circuits (the data may be input data blocks, input state data blocks, intermediate results, state intermediate results, etc.) and operators Forward.
  • the main processing circuit may further include: one or any combination of conversion processing circuit 110, activation processing circuit 111, and addition processing circuit 112;
  • the conversion processing circuit 110 is used for data conversion processing, specifically: the data received by the main processing circuit (including but not limited to: input data Xt, weight data (weight of each gate), input status value Ct-1,
  • the input result ht-1) performs the exchange between the first data structure and the second data structure (for example, conversion of continuous data and discrete data, for example, conversion of floating-point data and fixed-point data).
  • the activation processing circuit 111 is used to execute the activation operation of the data in the main processing circuit
  • the addition processing circuit 112 is used to perform an addition operation or an accumulation operation.
  • the operation operator is a matrix multiplying matrix operator, an accumulation operator, an activation operator, and other calculation operators.
  • the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
  • the tree module has a sending and receiving function.
  • the tree module is a sending function
  • the tree module is a receiving function.
  • the tree module is used to forward data between the master processing circuit and the plurality of slave processing circuits (the data may be an input data block, an input state data block, an intermediate result, an intermediate state result, etc.).
  • the tree-shaped module is a selectable result of the computing device, which may include at least one layer of nodes, the node is a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
  • the tree-shaped module may be an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 1-4c, and of course, a trigeminal tree structure, where n may be an integer greater than or equal to 2.
  • the specific implementation of the present application does not limit the specific value of the above-mentioned n.
  • the above-mentioned number of layers may also be 2.
  • the slave processing circuit may be connected to nodes of layers other than the penultimate layer node, for example, as shown in FIG. 1-4c The nodes of the penultimate layer shown.
  • the operation unit may carry a separate buffer, as shown in FIG. 1-2a, and may include: a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
  • the operation unit may further include: a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
  • the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIGS. 1-5, where,
  • the branch processing circuit 103 may include a memory. As shown in FIGS. 1-5, the size of the memory of the branch processing circuit 103 may be between 2 and 2.5 times the maximum data capacity to be stored by a single slave processing circuit.
  • the slave processing circuit does not need to set a memory. Compared with a branch processing circuit, it only needs to set 2.5*R (the capacity value required by a single slave processor circuit). If there is no branch processing circuit, then it needs to set 4*R, and its The utilization rate of the register is still low, so this structure can effectively reduce the total capacity of the memory and reduce the cost.
  • the branch processing circuit is used to forward the data between the master processing circuit and the plurality of slave processing circuits (the data may be an input data block, an input state data block, an intermediate result, an intermediate state result, etc.).
  • the intermediate result w is the input data block.
  • the w value, i is the column value of the column element calculated with the input data block, and the main processing circuit determines that the position of the intermediate result in the operation result of the corresponding gate is w, i. For example, if the input data block 1,1 and the input intermediate result calculated in the first column of weights are 1,1, the main processing circuit arranges the input intermediate result 1,1 in the first row and first column of the operation result of the corresponding gate.
  • the present application also provides an LSTM operation method.
  • the method is applied to a computing device.
  • the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate.
  • the computing device includes an arithmetic unit, a controller unit, and a storage unit. Unit; the storage unit stores: LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct; the method includes the following steps:
  • Step S601 The controller unit obtains input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and converts input data Xt, weight data, input state value Ct-1, The input result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
  • Step S602 The arithmetic unit performs input gate operation, forget gate operation, output gate operation, and update status gate operation based on input data Xt, weight data, input result ht-1, and LSTM operator to obtain the output of each gate As a result, the output data ht and the output state value Ct are obtained based on the input state value Ct-1 and the output result of each gate.
  • the arithmetic unit includes: a master processing circuit and a slave processing circuit; the arithmetic unit performs input gate operation and forget gate operation according to input data Xt, weight data, input result ht-1 and LSTM arithmetic operator
  • the operation, the operation of the output gate, and the operation of updating the state gate to obtain the output result of each gate specifically include:
  • the controller unit constructs a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator according to the LSTM operator;
  • the main processing circuit reorders the input data Xt, weight data and input state value according to a sorting operator, the weight data includes: weight data of each gate, and then divides the weight of each gate according to a split algorithm
  • the data and multiplication operators are broadcast to the slave processing circuit, split the input data and input state value into multiple input data blocks and multiple input state data blocks, and distribute multiple input data blocks and multiple input state data blocks to all Describe the processing circuit;
  • the slave processing circuit multiplies the input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and converts the plurality of input state data blocks and each gate according to the multiplication operator Perform the multiplication operation on the weight data to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
  • the main processing circuit sorts the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, performs an offset operation on the sorting result of each gate according to the addition operator to obtain the operation result of each gate, and according to the sorting operator will
  • the intermediate node of each state is sorted to obtain the state sorting result of each gate, and the offset operation of the state sorting result of each gate is performed according to the addition operator to obtain the state operation result of each gate; the operation result of each gate and each gate are calculated according to the addition operator
  • the result of the state operation corresponds to the subsequent processing after addition to obtain the output result of each gate.
  • the output state value Ct according to the input state value Ct-1 and the output result of each gate specifically includes:
  • the main processing circuit multiplies the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and updates the output result gt of the state gate to the output result it of the input gate according to the multiplication operator
  • the second result is obtained by multiplication, and the first result and the second result are added to obtain the output state value Ct.
  • the output data ht obtained according to the input state value Ct-1 and the output result of each gate specifically includes:
  • the main processing circuit performs an activation operation on the output state value Ct according to an activation operator to obtain an activation result, and multiplies the output result Ot of the output gate by the activation result to obtain an output result ht.
  • This application also discloses an LSTM device, which includes one or more computing devices mentioned in this application, used to obtain data to be calculated and control information from other processing devices, perform a specified LSTM operation, and the execution result passes I
  • the /O interface is passed to peripheral devices.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support the operation of larger-scale convolutional neural network training.
  • the interconnection method can be any interconnection topology.
  • the LSTM device has high compatibility and can be connected to various types of servers through the PCIE interface.
  • the present application also discloses a combined processing device, which includes the above-mentioned LSTM device, universal interconnection interface, and other processing devices.
  • the LSTM computing device interacts with other processing devices to complete the operation specified by the user.
  • Figures 1-7 are schematic diagrams of combined processing devices.
  • processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the LSTM computing device and external data and control, including data handling, and complete the basic control of starting and stopping the LSTM computing device; other processing devices can also cooperate with the LSTM computing device to complete the computing task.
  • the universal interconnection interface is used to transfer data and control operators between the LSTM device and other processing devices.
  • the LSTM device obtains the required input data from other processing devices and writes to the storage device on the LSTM device chip; it can obtain control operators from other processing devices and write to the control buffer on the LSTM device chip; it can also read the LSTM device
  • the data in the storage module is transferred to other processing devices.
  • the structure may further include a storage device, which is respectively connected to the LSTM device and the other processing device.
  • the storage device is used to store data stored in the LSTM device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the LSTM device or other processing device.
  • the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above-mentioned LSTM device or combination processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the above chip packaging structure.
  • FIG. 1-9 provides a board card.
  • the board card may also include other supporting components.
  • the supporting components include but are not limited to: a storage device 390 and an interface device 391. And control device 392;
  • the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
  • the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
  • DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
  • the interface device is electrically connected to the chip in the chip packaging structure.
  • the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
  • an electronic device is applied, which includes the above-mentioned board.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
  • the vehicles include airplanes, ships, and/or vehicles;
  • the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
  • the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
  • FIG. 2-1 is a schematic diagram of a GRU provided by an embodiment of the present application ,
  • the GRU includes: input layer, hidden layer, reset gate, update gate, current memory gate and output layer. Connect the door, update the door and the current memory door.
  • the hidden layer of the previous GRU unit is connected to the reset door, update door, current memory door and output layer of the current GRU unit.
  • GRU is LSTM (Long Short-term Memory Network Term Memory)
  • the output result of the reset gate in Figure 2-1 z t is used to control the degree to which the state information of the previous moment is brought into the current state
  • the output result of the reset gate r t is used
  • FIG. 2-2 is a computing device provided by an embodiment of the present application.
  • the computing device is used to perform GRU operations.
  • the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, and a current memory Gate and output layer;
  • the computing device is used to obtain the input data x t input at the input layer t time, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the computing device is used to call a pre-constructed GRU operator from a pre-packaged function library
  • the computing device is configured to input input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
  • the technical solution provided by this application pre-compiles the GRU calculation process into corresponding operators, so that the GRU operation is performed on the MLU without the need for CPU access to instruction decoding and data memory, which improves the GRU operation speed and improves operation efficiency.
  • the computing device when inputting the input data x t , the output data h t-1 and the weight value into the pre-constructed GRU operator, and obtaining the output result h t is specifically used for:
  • the input output data h -1 is a preset initial value
  • the GRU is a multi-layer GRU
  • the input The output data h t-1 is an initialized vector.
  • the main processing circuit will input the output result h t of this layer into the shaping operator and the split operator to get the final output result, so the output data h t-1 of the hidden layer input of the previous GRU received by the GRU of this layer is essentially has a good split of the plurality of output data blocks, therefore, the main data processing circuit without an output data h t-1 split operation, only the received output data h t-1 from the processing corresponding to the distribution Circuit, you can perform the calculation process of this layer of GRU.
  • the operator is a mapping from one function space to another function space.
  • the machine learning processor MLU is applied to machine learning operations, where machine learning operations include neural network operations, k-means operations, support vector machine operations, etc.
  • the machine learning processor MLU may specifically include NPU (Neural-Network Processing Unit, Neural network processor unit), DSP (Digital Signal Process, digital signal processing), one or a combination of field programmable gate array (Field-Programmable Gate Array, FPGA) chips.
  • NPU Neural-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing
  • FPGA Field-Programmable Gate Array
  • each operation process of the GRU is compiled into its corresponding operator in advance. If multiple operators are obtained, the compiler will be compiled. A number of good operators are pre-packaged in the function library.
  • the corresponding GRU operator is retrieved from the pre-packaged function library through the function interface, and the input data is input into the retrieved GRU operator, The operation process corresponding to the GRU operator is executed, and the output result is obtained.
  • GRU the operation of GRU is as follows:
  • r t sigmoid(W ir *x t +b ir +W hr *h t-1 +b hr );
  • n t tanh(W in *x t +b in +r t ⁇ (W hn *h t-1 +b hn ));
  • h t (1-z t ) ⁇ n t +z t ⁇ h t-1.
  • x t is the input data at time t
  • h t-1 is the output data of the hidden layer input of the previous GRU
  • r t is the output of the reset gate
  • z t is the output of the update gate
  • n t is the current memory gate
  • H t represents the output at time t
  • W r , W z and W n represent the weights corresponding to the reset gate, update gate, and current memory gate
  • b r , b z, and b n represent the reset, respectively
  • the offsets corresponding to the gate, update gate, and current memory gate, W ir , W hr , b ir , and b hr are the first weight, the second weight, the first offset, and the second offset corresponding to the reset gate, respectively
  • W iz , W hz , b iz , b hz are the first weight, the second weight, the first offset, the second offset corresponding to the update gate
  • each step of the GRU operation process is implemented by constructing operators in this application, if vector splicing is performed, each calculation is called When the operator performs the operation, it is necessary to split the stitched weights and offsets to obtain the weights and offsets required by each operator.
  • the invalid stitching and splitting process is performed, which affects the speed of the operation, so this application is acquiring
  • the weights and offsets are split into the weights and offset blocks corresponding to the reset gate, update gate and current memory gate in advance, and added to each weight and offset block after the respective gates, the input data and output data h t h t-1 corresponding to the identification information, in the calculation of the output of each gate to a query based on the identification information weights and corresponding to the gate bias, corresponding directly with
  • the input data and output data are calculated to ensure that the GRU operation is performed on the MLU, while the GRU operation speed is increased and the operation efficiency is improved.
  • construct the operator corresponding to the reset gate specifically: construct the first multiplication operator (W ir *x t +b ir ), the second multiplication operator (W hr *h t-1 +b hr ).
  • the first addition operator is used to sum the output results of the first multiplication operator and the second multiplication operator.
  • the first activation operator is used to activate the output result of the first addition operator to obtain a reset gate Output r t , the activation type of the first activation operator is sigmoid; construct the operator corresponding to the update gate, specifically: construct the third multiplication operator W iz *x t +b iz , the fourth multiplication operator W hz *h t-1 +b hz , the second addition operator, used to sum the output of the third multiplication operator and the fourth multiplication operator, the second activation operator, used to activate the second addition operator
  • the output result is the output z t of the update gate.
  • the activation type of the second activation operator is sigmoid; construct the operator corresponding to the current memory gate, specifically: construct the fifth multiplication operator W in *x t +b in and The sixth multiplication operator W hn *h t-1 +b hn , the first vector multiplication operator r t ⁇ (W hn *h t-1 +b hn ), which is used to output the result of the sixth multiplication operator Perform dot product with r t , the third addition operator, used to sum the output of the fifth multiplication operator and the first vector multiplication operator, and the third activation operator, used to activate the output of the third addition operator As a result, the output result n t of the current memory gate is obtained, and the activation type of the third activation operator is tanh; the operator corresponding to the output layer is constructed, specifically: the second vector multiplication operator is constructed and executed on z t and n t Dot multiplication, calculate z t ⁇ n t, the first subtraction operator, used
  • the computing device before calling the pre-constructed GRU operator from the pre-packaged function library, the computing device is also used to obtain the offset.
  • the calculation device is specifically configured to: acquire the first multiplication operator, the second multiplication operator, and the first corresponding to the reset gate in the GRU operator An addition operator and a first activation operator, the activation type of the first activation operator is sigmoid; input the input data x t , weight and offset into the first multiplication operator, and calculate (W ir * x t +b ir ) to obtain the first operation result, where W ir and b ir are the first weight and the first offset corresponding to the reset gate in the weight and the offset respectively; the output data h t- 1.
  • the weight and offset are input into the second multiplication operator, and (W hr *h t-1 +b hr ) is calculated to obtain a second operation result, where W hr and b hr are the weight and the offset Centering the second weight value and the second offset corresponding to the reset gate respectively; inputting the first operation result and the second operation result into the first addition operator for summation to obtain the first And the result; the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
  • the calculation device is specifically configured to: obtain a third multiplication operator, a fourth multiplication operator, and a second addition operation corresponding to the update gate in the GRU operator And the second activation operator, the activation type of the second activation operator is sigmoid; input the input data x t , weight and offset into the third multiplication operator, and calculate (W iz *x t +b iz ) to get the third operation result, where W ir and b ir are the first weight and the first offset corresponding to the update gate in the weight and offset respectively; output the data h t-1 and the weight The value and the offset are input into the fourth multiplication operator, and (W hz *h t-1 +b hz ) is calculated to obtain a fourth operation result, where W hz and b hz are the weight and offset respectively A second weight value and a second offset corresponding to the update gate; input the third operation result and the fourth operation result into the second addition operator
  • the computing device is specifically configured to: obtain the fifth multiplication operator and the sixth multiplication operator corresponding to the current memory gate in the GRU operator, A first vector multiplication operator, a third addition operator, and a third activation operator, the activation type of the third activation operator is tanh; input data x t , weight, and offset are input to the fifth multiplication Operator, calculate (W in* x t +b in ) to get the fifth operation result, where W in and b in are the first weight and the first offset corresponding to the current memory gate in the weight and offset respectively Set; input the output data h t-1 , weight and offset to the sixth multiplication operator, calculate (W hn *h t-1 +b hn ), get the sixth operation result, where, W hn and b hn is the second weight and the second offset respectively corresponding to the current memory gate among the weights and the offset; the sixth calculation result and the output result r t of
  • the computing device is specifically configured to: obtain a second vector multiplication operator, a first subtraction operator, a third vector multiplication operator, and a fourth addition operator corresponding to the output layer in the GRU operator ; output gate outputs the updated and the current memory z t n t is input to the gate of the second vector multiplication operator performs point multiplication, the result obtained by the second point; the output of the current memory door n t and the second point multiplication result are input to the first subtraction operator, and a subtraction operation is performed to obtain a first difference result; the output result z t of the update gate and the output data h t-1 are input to the A third vector multiplication operator, performing a dot product operation to obtain a third dot product result; inputting the first difference result and the third dot product result to the fourth addition operator for summation to obtain an output result h t .
  • the foregoing computing device specifically includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit;
  • the controller unit is used to obtain the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the controller unit is used to call a pre-constructed GRU operator from a pre-packaged function library
  • the controller unit is configured to send input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
  • the main processing circuit is used for splitting input data x t into multiple input data blocks, splitting output data h t-1 into multiple output data h t-1 , splitting multiple input data blocks and output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit;
  • the processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the reset gate in some operators, to obtain the intermediate result corresponding to the reset gate, and The intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result into the operator corresponding to the reset gate in another part of the operators in the GRU operator to obtain the output result r t of the reset gate;
  • the main processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the update gate in some operators, to obtain the intermediate result of the update gate, and send the intermediate result To the main processing circuit, the main processing circuit inputs the intermediate result into the operator corresponding to the update gate in another part of the operator, and obtains the output result r t of the reset gate
  • the master processing circuit is used to distribute the output result r t of the reset gate to the slave processing circuit;
  • the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
  • the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
  • the main processing circuit for updating the gate output z t, the current output of the memory gate n t, h t-1 output data is input to the Operator Operator and another portion corresponding to the output layer, the output obtained h t .
  • the foregoing computing device may further include: a storage unit 10 and a direct memory access unit 50.
  • the storage unit 10 may include: one or any combination of registers and caches. Specifically, the cache is used to store calculation instructions; The register is used to store the input data and scalar; the cache is a high-speed temporary storage cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
  • the instruction storage unit 110 is used to store GRU operators associated with GRU operations
  • the instruction processing unit 111 is configured to parse the GRU operator to obtain multiple GRU operators;
  • the storage queue unit 113 is used to store an instruction queue including multiple GRU operators to be executed in the order of the queue.
  • the above register may be an off-chip memory. Of course, in actual application, it may also be an on-chip memory for storing data.
  • the data may specifically be multi-dimensional (more than 2 dimensions) data.
  • controller unit may further include:
  • the dependency processing unit 108 is configured to determine whether there is an association relationship between the first GRU operator and the zeroth GRU operator before the first GRU operator when there are multiple GRU operators, such as the first
  • the GRU operator is associated with the zeroth GRU operator, the first GRU operator is cached in the instruction storage unit, and after the execution of the zeroth GRU operator is completed, the instruction is stored from the instruction
  • the unit extracts the first GRU operator and transmits it to the arithmetic unit;
  • the determining whether the first GRU operator is associated with the zeroth operation instruction before the first GRU operator includes:
  • Extract the first storage address interval of the data (such as a matrix) required in the first GRU operator according to the first GRU operator, and extract the required in the zeroth GRU operator according to the zeroth GRU operator
  • the zeroth storage address interval of the matrix if the first storage address interval overlaps with the zeroth storage address interval, it is determined that the first GRU operator and the zeroth GRU operator have an association relationship, If the first storage address interval and the zeroth storage address interval do not overlap, it is determined that the first GRU operator and the zeroth GRU operator do not have an association relationship.
  • the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
  • multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that ,
  • the K slave processing circuits shown in Figure 2-3 include only n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, that is, the k
  • the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
  • K slave processing circuits are used for forwarding of input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the main processing circuit 101 may further include one or any combination of a conversion processing circuit 110, an activation processing circuit 111, and an addition processing circuit 112;
  • the conversion processing circuit 110 is used to perform conversion processing on the data, specifically: before performing the GRU operation, the conversion processing circuit 110 is specifically used to: obtain the shaping operator and the split operator received by the main processing circuit 101, and convert the main processing circuit 101
  • the received input data x t , output data h t-1 weight and offset are adjusted to the preset four-dimensional tensor format, that is, the exchange between the first data structure and the second data structure is performed (for example, continuous data and The conversion of discrete data); when the output result h t is obtained, the output result h t is input to the shaping operator and the split operator in another part of the operator, and the output result h t is adjusted to a preset format (ie four-dimensional Zhang Quantity format) to get the final output result.
  • a preset format ie four-dimensional Zhang Quantity format
  • the activation processing circuit 111 is used to execute the activation operation of the data in the main processing circuit
  • the addition processing circuit 112 is used to perform an addition operation or an accumulation operation.
  • the slave processing circuit 102 may further include: one or any combination of the multiplication processing circuit 120 and the accumulation processing circuit 121;
  • the multiplication processing circuit 120 is used to perform multiplication operations of data from the processing circuit, such as vector and vector dot multiplication, matrix and matrix dot multiplication, matrix and matrix convolution, matrix and vector convolution, etc. Wait;
  • the accumulation processing circuit 121 is used to perform an accumulation operation.
  • the calculation instruction to be executed in the GRU operator is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
  • the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
  • the above tree module has a sending and receiving function, as shown in Figure 2-4a, the tree module is the sending function, and as shown in Figure 2-4b, the tree module is the receiving function.
  • the tree module is configured to forward the input data block, output data h t-1 , weight, offset, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the tree-shaped module is a selectable result of the computing device, which may include at least one layer of nodes, the node is a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
  • the tree-shaped module may have an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 2-4c, and of course, a trigeminal tree structure, where n may be an integer greater than or equal to 2.
  • the specific implementation of the present application does not limit the specific value of n, and the number of layers may also be 2.
  • the slave processing circuit may be connected to nodes other than the penultimate layer node, for example, as shown in FIG. 2-4c The nodes of the penultimate layer shown.
  • the arithmetic unit 12 may carry a separate buffer, as shown in FIG. 2-2a, and may include a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
  • the operation unit may further include: a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
  • the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIGS. 2-5, where,
  • the branch processing circuit 103 may include a memory. As shown in FIG. 2-5, the size of the memory of the branch processing circuit 103 may be between 2 and 2.5 times the maximum data capacity that a single slave processing circuit needs to store. The slave processing circuit does not need to set a memory. Compared with a branch processing circuit, it only needs to set 2.5*R (the capacity value required by a single slave processor circuit). If there is no branch processing circuit, then it needs to set 4*R, and its The utilization rate of the register is still low, so this structure can effectively reduce the total capacity of the memory and reduce the cost.
  • 2.5*R the capacity value required by a single slave processor circuit
  • the branch processing circuit is configured to forward input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • w the input data block
  • the main processing circuit determines that the position of the intermediate result output result in the hidden layer is w, i. For example, input data block input data block 1,1 and input intermediate result 1,1 calculated by the first column of weights, the main processing circuit arranges the input intermediate result 1,1 in the first row and first column of the hidden layer output result.
  • the multiplication processing circuit 120 is used to input the received input data block and the weight and offset into the first multiplication operator, and to the received input data Perform the product operation on the element value in the block and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the element value in the corresponding position in the offset to obtain the product result;
  • the received output data h t-1 And weights and offsets are input to the second multiplication operator, the product value of the element value in the received output data h t-1 and the corresponding position in the weights is performed, and the product result corresponds to the offset Perform sum operation on the other element value of the position to obtain another product result;
  • the accumulation processing circuit 121 is used to accumulate the product result to obtain the intermediate input result of the reset gate (W ir *x t +b ir ) ,Accumulate another product result to get the output intermediate result of reset gate (W hr *h t-1 +b hr );
  • the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
  • the addition processing circuit 112 is used to input the input intermediate result and the output intermediate result of the reset gate to the first addition operator, and to input the intermediate result and the output Perform the summation operation on the intermediate result to obtain the first summation result;
  • the activation processing circuit 111 is used to input the first summation result into the first activation operator, perform a sigmoid activation operation on the first summation result, and get a reset The output r t of the gate;
  • the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
  • the multiplication processing circuit 120 is used to input the received input data block and the weight and offset to the third multiplication operator, and to the received input data block Perform the product operation on the element value of the corresponding position in the weight and the element value of the corresponding position in the weight, and perform the sum operation on the element value of the corresponding position in the offset to obtain the product result;
  • the received output data h t-1 and the weight The value and the offset are input to the fourth multiplication operator, and the product value of the element value in the received output data h t-1 and the corresponding position in the weight is calculated, and the product result is compared with the corresponding position in the offset
  • Another element value performs a sum operation to obtain another product result;
  • the accumulation processing circuit 121 is used to accumulate the product result to obtain the input intermediate result of the update gate (W iz *x t +b iz ), and then another The product result is accumulated, and the intermediate output of the reset gate is obtained (W hz *h
  • the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
  • the addition processing circuit 112 is configured to input the input intermediate result and the output intermediate of the update gate into the second addition operator, and execute the input intermediate result and the output intermediate
  • the summation operation obtains the second summation result
  • the activation processing circuit 111 is used to input the second summation result into the second activation operator, perform a sigmoid activation operation on the second summation result, and obtain the output result of the update gate z t
  • the second addition operator and the second activation operator are the operators corresponding to the update gate in another part of the operators.
  • the multiplication processing circuit 120 is used to input the received input data block, weight and offset to the fifth multiplication operator, and to the received input data block Perform the product operation on the element value in the weight and the element value in the corresponding position in the weight value, and perform the sum operation on the product value and the element value in the corresponding position in the offset to obtain the product result;
  • the received output data h t-1 and The weight and offset are input into the sixth multiplication operator, and the product value of the element value in the received output data h t-1 and the corresponding position in the weight is calculated, and the product result is in the corresponding position in the offset Perform the sum operation on the other element value of to obtain another product result;
  • the accumulation processing circuit 121 is used to accumulate the product result to obtain the input intermediate result (W in *x t +b in ) of the current memory gate, Another product result is accumulated to obtain the output intermediate result of the current memory gate (W nz *h t-1 +b nz );
  • the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
  • the addition processing circuit 112 is used to input the intermediate input result of the current memory gate and the first point multiplication result into the third addition operator, for the current memory gate
  • the input intermediate result and the dot product result perform a sum operation to obtain a third sum result
  • an activation processing circuit 111 is used to input the third sum result into the third activation operator, and perform tanh on the third sum result Activate the operation and get the output result n t of the current memory gate;
  • the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
  • the multiplication result output processing circuit 120 for updating the gate current and the output z t n t memory gate input to the second vector multiplication operator, update of the output gate of the output z t and the current memory gate n t Perform a dot product to obtain a second dot product result, input the received update gate output result z t and output data h t-1 into the third vector multiplication operator, and update the gate output result z t and output data h t-1 performs dot multiplication to obtain the third dot multiplication result, and sends the second dot multiplication result and the third dot multiplication result to the main processing circuit 101; the addition processing circuit 112 is used to output the current memory gate output result n t and The second point multiplication result is input into the first subtraction operator, and a subtraction operation is performed on the output result n t of the current memory gate and the point multiplication result to obtain the first difference result, and the third point multiplication result and the first difference result Input to the fourth addition operator, perform summation on the
  • the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth subtraction operator are corresponding to the output layer in another operator operator.
  • the present application also provides a GRU calculation method.
  • the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer.
  • the calculation method is applied to Computing device, the calculation method includes:
  • Step S601 The computing device obtains the input data x t input at the time of the input layer t, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value.
  • Step S602 The computing device calls a pre-constructed GRU operator from a pre-packaged function library.
  • Step S603 The computing device inputs the input data x t , the output data h t-1 , and the weight value into the pre-constructed GRU operator to obtain an output result h t .
  • the input data x t , output data h t-1 , and weights are input into the pre-constructed GRU operator, and the output result h t specifically includes:
  • the method before calling the pre-constructed GRU operator from the pre-packaged function library, the method further includes:
  • the computing device obtains the offset.
  • the input data x t , output data h t-1 , and weights are input to the operator corresponding to the reset gate in the GRU operator, and the output result r t of the reset gate specifically includes :
  • the activation type of the first activation operator is sigmoid
  • the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
  • the input data x t , output data h t-1 , and the weight value are input to the operator corresponding to the update gate in the GRU operator, and the output result z t of the update gate specifically includes:
  • the activation type of the second activation operator is sigmoid
  • the second summation result is input into the second activation operator for activation, and the output result z t of the update gate is obtained.
  • the input data x t , the output data h t-1 , the weight value, and the output result r t of the reset gate are input to the operator corresponding to the current memory gate in the GRU operator to obtain
  • the output of the current memory gate n t specifically includes:
  • the output data h t-1 , the weight and the offset are input to the sixth multiplication operator, and the sum (W hn *h t-1 +b hn ) is calculated to obtain a sixth operation result, where W hn and b hn is the second weight and the second offset corresponding to the current memory gate respectively among the weights and the offset;
  • the sixth operation result and the output result r t of the reset gate are input to the first vector multiplication operator, and the output data r t of the reset gate is multiplied by the sixth operation result to obtain the first Dot product
  • the third summation result is input to the third activation operator for activation to obtain the current memory gate output result n t .
  • the gate of the output update z t, the current output of the memory gate and the output data of n t h t-1 is input to the operator and the operator output layer corresponding to the GRU obtain output h t specifically includes:
  • the updated gate output and the current output z t n t memory gate input to said second vector multiplication operator performs point multiplication, the result obtained by the second point;
  • the first difference result and the third point product result are input to the fourth addition operator for summation to obtain an output result h t .
  • the computing device specifically includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a slave processing circuit; and the method specifically includes:
  • the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
  • the controller unit calls a pre-constructed GRU operator from a pre-packaged function library
  • the controller unit sends input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
  • the main processing circuit splits the input data x t into multiple input data blocks, splits the output data h t-1 into multiple output data h t-1 , and splits the multiple input data blocks and multiple output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit; the slave processing circuit inputs the received input data block, output data h t-1 and the weights to the part Among the operators corresponding to the reset gate, the intermediate result of the reset gate is obtained, and the intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result to another part of the operator in the GRU operator In the operator corresponding to the reset gate, the output result r t of the reset gate is obtained;
  • the master processing circuit distributes the output result r t of the reset gate to the slave processing circuit
  • the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
  • the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
  • the main processing circuit updates the output gate z t, the current output of the memory gate n t, h t-1 output data is inputted to another portion of the operator and the operator corresponding to the output layer, the output result obtained h t.
  • the method further includes: the control The processor unit obtains the offset and sends the offset to the master processing circuit; the master processing circuit broadcasts the offset to the slave processing circuit.
  • the input output data h -1 is a preset initialization value
  • the GRU is a multi-layer GRU
  • the input output data h t-1 is an initialized vector.
  • the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and a root port of the tree module Connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;
  • the tree module forwards input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
  • the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit;
  • the branch processing circuit forwards the input data block, output data h t-1 , weight, offset, and intermediate result between the master processing circuit and the plurality of slave processing circuits.
  • the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to the K slave processing circuits in the plurality of slave processing circuits, the k basic circuits being: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column ;
  • the input data blocks, output data h t-1 , weights, offsets, and intermediate results of the K slave processing circuits between the master processing circuit and the plurality of slave processing circuits are forwarded.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the obtaining the intermediate output of the reset gate specifically includes:
  • the multiplication processing circuit inputs the received input data block and the weight and offset into the first multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight , And perform the sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the second multiplication operator, and The element value in the output data h t-1 and the element value at the corresponding position in the weight perform a product operation, and the product result is summed with another element value at the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the reset gate (W ir *x t +b ir ), and accumulates another product result to obtain an output intermediate result of the reset gate (W hr *h t-1 +b hr );
  • the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit;
  • the obtaining the output result r t of the reset gate specifically includes:
  • the addition processing circuit inputs the input intermediate result and the output intermediate result of the reset gate into the first addition operator, and performs a sum operation on the input intermediate result and the output intermediate result to obtain a first summation result;
  • the activation processing circuit inputs the first summation result into the first activation operator, performs a sigmoid activation operation on the first summation result, and obtains an output result r t of the reset gate;
  • the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the intermediate output result of the update gate specifically includes:
  • the multiplication processing circuit inputs the received input data block and the weight and offset to the third multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, The product result and the element value of the corresponding position in the offset are summed to obtain the product result; the received output data h t-1 and the weight and offset are input to the fourth multiplication operator. Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hz *h t-1 +b hz );
  • the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit;
  • the obtained output result z t of the update gate specifically includes:
  • the addition processing circuit inputs the input intermediate result and the output intermediate of the update gate into the second addition operator, performs a sum operation on the input intermediate result and the output intermediate, and obtains a second summation result;
  • the activation processing circuit inputs the second summation result into the second activation operator, performs a sigmoid activation operation on the second summation result, and obtains the output result z t of the update gate ;
  • the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
  • the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; the obtaining the intermediate result of the current memory gate output specifically includes:
  • the multiplication processing circuit inputs the received input data block, weight and offset to the fifth multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, And sum the product result with the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 , the weight and the offset into the sixth multiplication operator, and Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another product result;
  • the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W in *x t +b in ) of the current memory gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the current memory gate ( W nz *h t-1 +b nz );
  • the multiplication processing circuit inputs the output result r t of the reset gate into the first vector multiplication operator, and performs a dot product operation on the output result r t of the reset gate and the intermediate result of the output of the current memory gate to obtain the first point Multiply the result
  • the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
  • the main processing circuit includes an activation processing circuit and an addition processing circuit;
  • the obtaining the output result n t of the current memory gate specifically includes:
  • the addition processing circuit inputs the input intermediate result of the current memory gate and the first dot product to the third addition operator, and performs a sum operation on the input intermediate result of the current memory gate and the first dot product to obtain the third Sum the results;
  • the activation processing circuit inputs the third summation result into the third activation operator, performs a tanh activation operation on the third summation result, and obtains the output result n t of the current memory gate;
  • the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
  • the master processing circuit includes an addition processing circuit
  • the slave processing circuit includes a multiplication processing circuit
  • the output result of the determination output layer specifically includes:
  • the main processing circuit updates the output gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
  • the multiplication circuit output gate update z t and the current output of the memory is input to the gate of a second n t vector multiplication operator, update the output of the output gate z t and the current memory door enforcement point n t Multiply to get the second point multiplication result, input the received update gate output result z t and output data h t-1 into the third vector multiplication operator, and update the gate output result z t and output data h t- 1 Perform dot multiplication to obtain the third dot multiplication result, and send the second dot multiplication result and the third dot multiplication result to the main processing circuit;
  • the addition processing circuit inputs the current memory gate output result n t and the second dot product to the first subtraction operator, and performs a subtraction operation on the current memory gate output result n t and dot product to obtain the first difference Value result, input the third dot product result and the first difference result to the fourth addition operator, perform summation on the second dot product result and the first difference result, and obtain the output result h t ;
  • the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth addition operator are corresponding to the output layer in another operator operator.
  • the main processing circuit includes a conversion processing circuit
  • the conversion processing circuit inputs the output result h t to the shaping operator and the split operator in another part of the operators, and adjusts the data format of the output result h t to a preset format to obtain the final output result.
  • This application also discloses a GRU device, which includes one or more computing devices mentioned in this application, which are used to obtain data to be calculated and control information from other processing devices, perform specified GRU operations, and the execution result passes I
  • the /O interface is passed to peripheral devices.
  • Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
  • the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale convolutional neural network training operations.
  • the interconnection method can be any interconnection topology.
  • the GRU device has high compatibility, and is connected with various types of servers through a PCIE interface.
  • the present application also discloses a combined processing device, which includes the above-mentioned GRU device, general interconnection interface, and other processing devices.
  • the GRU computing device interacts with other processing devices to complete the operation specified by the user.
  • Figures 2-7 are schematic diagrams of combined processing devices.
  • Other processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
  • the number of processors included in other processing devices is not limited.
  • Other processing devices serve as the interface between the GRU computing device and external data and control, including data handling, and complete the basic control of starting and stopping the GRU computing device; other processing devices can also cooperate with the GRU computing device to complete the computing task.
  • the general interconnection interface is used to transfer data and control instructions between the GRU device and other processing devices.
  • the GRU device obtains required input data from other processing devices and writes to the storage device on the GRU device chip; it can obtain control instructions from other processing devices and write to the control buffer on the GRU device chip; it can also read the GRU device
  • the data in the storage module is transmitted to other processing devices.
  • the structure may further include a storage device, which is respectively connected to the GRU device and the other processing device.
  • the storage device is used to store data stored in the GRU device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the GRU device or other processing device.
  • the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
  • the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
  • a chip is also applied, which includes the above-mentioned GRU device or combined processing device.
  • a chip packaging structure is applied, which includes the above chip.
  • a board card is applied, which includes the above chip packaging structure.
  • FIG. 2-9 provides a board card.
  • the board card may also include other supporting components.
  • the supporting components include but are not limited to: a storage device 390 and an interface device 391 And control device 392;
  • the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
  • the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
  • DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
  • the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transfer data twice in one clock cycle.
  • a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
  • the interface device is electrically connected to the chip in the chip packaging structure.
  • the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
  • the interface device may also be other interfaces.
  • the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
  • the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
  • the control device is electrically connected to the chip.
  • the control device is used to monitor the state of the chip.
  • the chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a microcontroller (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
  • the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)

Abstract

本申请提供一种计算装置及板卡,所述计算装置用于执行LSTM运算,所述板卡包括:存储器件、接口装置和控制器件以及神经网络芯片,所述神经网络芯片包括计算装置,所述存储器件,用于存储数据;所述接口装置,用于实现所述芯片与外部设备之间的数据传输;所述控制器件,用于对所述芯片的状态进行监控。本申请提供的计算装置具有功耗低的优点。

Description

计算装置及板卡 技术领域
本申请涉及神经网络领域,尤其涉及一种计算装置及板卡。
背景技术
长短时间记忆网络(LSTM)是一种时间递归神经网络(RNN),由于网络本身独特的结构设计,LSTM适合于处理和预测时间序列中间隔和延时非常长的重要事件。相比于传统的递归神经网络,LSTM网络表现出更好的性能,它非常适合从经验中学习,以便在重要事件之间具有未知大小时间之后时,对时间序列进行分类、处理和预测。目前,在语音识别、视频描述、机器翻译和音乐自动合成等诸多领域,LSTM网络被广泛应用。
现有的LSTM网络基于通用处理器实现,现有的处理器执行LSTM运算的能耗高。
发明内容
本申请提供一种计算方法及相关产品,可提升LSTM的处理速度,节省功耗。
第一方面,提供一种所述计算装置用于执行LSTM运算,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;
所述存储单元,用于存储LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;
所述控制器单元,用于获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
所述运算单元,用于依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
可选的,所述运算单元包括:主处理电路以及从处理电路;
所述控制器单元,具体用于根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;
所述主处理电路,具体用于依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;
所述从处理电路,用于依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;
所述主处理电路,用于依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结果排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
可选的,所述主处理电路,具体用于依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
可选的,所述主处理电路,具体用于依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
可选的,所述后续处理具体包括:
如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;
如为更新状态门,所述后续处理为运算激活tanh函数。
可选的,所述主处理电路,还用于将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
可选的,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
可选的,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,
所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
可选的,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
所述K个从处理电路,用于转发所述主处理电路以及多个从处理电路之间的数据以及算子。
可选的,所述主处理电路包括:转换处理电路;
所述转换处理电路,用于对数据执行转换处理,具体为:将主处理电路接收的数据执行第一数据结构与第二数据结构之间的互换。
可选的,所述从处理电路包括:乘法处理电路和累加处理电路;
所述乘法处理电路,用于对接收到的输入数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的乘积结果;接收到的输入状态数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的另一乘积结果;
所述累加处理电路,用于对该各个门的乘积结果执行累加运算得到各个门的中间结果,将该各个门的另一乘积结果执行累加运算得到各个门的状态中间结果。
可选的,所述树型模块为n叉树结构,所述n为大于等于2的整数。
第二方面,本申请实施例提供了一种LSTM运算装置,所述LSTM运算装置包括一个或多个第一方面提供的计算装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的LSTM运算,将执行结果通过I/O接口传递给其他处理装置;
当所述LSTM装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;
其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的LSTM的运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
第三方面,提供一种组合处理装置,所述组合处理装置包括第二方面的LSTM运算装置,通用互联接口和其他处理装置;
所述LSTM运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。
第四方面,提供一种神经网络芯片,神经网络芯片包括第一方面提供的计算装置或第二方面提供的LSTM运算装置或第三方面提供的组合处理装置。
第五方面,提供一种电子设备,所述电子设备包括如第四方面提供的芯片。
第六方面,提供一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及第四方面提供的神经网络芯片;
其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述芯片与外部设备之间的数据传输;
所述控制器件,用于对所述芯片的状态进行监控。
第七方面,本申请实施例还提供一种LSTM运算方法,所述LSTM包括:所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;所述存储单元存储:LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;
所述方法包括如下步骤:
所述控制器单元获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
在一些实施例中,所述电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
在一些实施例中,所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
第八方面,提供一种门控循环单元GRU的运算方法,所述GRU包括:输入层、隐层、重置门、更新门、当前记忆门和输出层,所述运算方法应用于计算装置,所述运算方法包括:
所述计算装置获取输入层t时刻输入的输入数据x t、前一个GRU的隐层输入的输出数据h t-1和权值;
所述计算装置从预先封装的函数库中调用预先构造的GRU算子;
所述计算装置将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t
在一种可选的方案中,所述将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t具体包括:
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t
将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t
在一种可选的方案中,在从预先封装的函数库中调用预先构造的GRU算子之前,所述方法还包括:
所述计算装置获取偏置。
在一种可选的方案中,所述将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t具体包括:
获取所述GRU算子中与重置门对应的第一乘法算子、第二乘法算子、第一加法算子以及第一激活算子,所述第一激活算子的激活类型为sigmoid;
将输入数据x t、权值以及偏置输入到所述第一乘法算子中,计算(W ir*x t+b ir),得到第一运算结果,W ir和b ir为权值和偏置中分别与重置门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第二乘法算子中,计算(W hr*h t-1+b hr),得到第二运算结果,W hr和b hr为权值和偏置中分别与重置门对应的第二权值和第二偏置;
将所述第一运算结果和所述第二运算结果输入到所述第一加法算子中求和,得到第一求和结果;
将所述第一求和结果输入到所述第一激活算子中激活,得到重置门的输出结果r t
在一种可选的方案中,所述将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t具体包括:
获取所述GRU算子中与更新门对应的第三乘法算子、第四乘法算子、第二加法算子以及第二激活算子,所述第二激活算子的激活类型为sigmoid;
将输入数据x t、权值以及偏置输入到所述第三乘法算子中,计算(W iz*x t+b iz),得到第三运算结果,其中,W ir和b ir为权值和偏置中分别与更新门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第四乘法算子中,计算(W hz*h t-1+b hz),得到第四运算结果,其中,W hz和b hz为权值和偏置中分别与更新门对应的第二权值和第二偏置;
将所述第三运算结果和所述第四运算结果输入到所述第二加法算子中,得到第二求和结果;
将所述第二求和结果输入到所述第二激活算子中激活,得到更新门的输出结果z t
在一种可选的方案中,所述将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t具体包括:
获取所述GRU算子中与当前记忆门对应的第五乘法算子、第六乘法算子、第一向量乘法算子、第三加法算子、第三激活算子,所述第三激活算子的激活类型为tanh;
将输入数据x t、权值以及偏置输入到所述第五乘法算子,计算(W in*x t+b in),得到第五运算结果,其中,W in和b in为权值和偏置中分别与当前记忆门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第六乘法算子,计算和(W hn*h t-1+b hn),得到第六运算结果,其中,W hn和b hn为权值和偏置中分别与当前记忆门对应的第二权值和第二偏置;
将所述第六运算结果以及重置门的输出结果r t输入到所述第一向量乘法算子,对重置门的输出数据r t与所述第六运算结果进行点乘,得到第一点乘结果;
将所述第一点乘结果与所述第五运算结果输入到所述第三加法算子中求和,得到第三求和结果;
将所述第三求和结果输入到所述第三激活算子激活,得到当前记忆门的输出结果n t
在一种可选的方案中,所述将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t具体包括:
获取所述GRU算子中与输出层对应的第二向量乘法算子、第一减法算子、第三向量乘法算子、第四加法算子;
将更新门的输出结果z t以及当前记忆门的输出结果n t输入到所述第二向量乘法算子,进行点乘运算,得到第二点乘结果;
将所述当前记忆门的输出结果n t以及所述第二点乘结果输入到所述第一减法算子,执行减法运算,得到第一差值结果;
将更新门的输出结果z t和输出数据h t-1输入到所述第三向量乘法算子,进行点乘运算,得到第三点乘结果;
将所述第一差值结果和所述第三点乘结果输入到所述第四加法算子求和,得到输出结果h t
在一种可选的方案中,所述计算装置包括:运算单元以及控制器单元;所述运算单元包括:主处理电路和至少一个从处理电路;所述方法具体包括:
所述控制器单元获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值;
所述控制器单元从预先封装的函数库中调用预先构造的GRU算子;
所述控制器单元将输入数据x t、输出数据h t-1、权值以及GRU算子发送给所述主处理电路;
所述主处理电路将输入数据x t拆分为多个输入数据块,将多个输入数据块、输出数据h t-1分发给从处理电路,将权值以及GRU算子中的部分算子广播给从处理电路;
从处理电路将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与重置门对应的算子中,得到重置门的中间结果,将该中间结果发送给主处理电路,主处理电路将该中间结果输入到GRU算子中的另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
从处理电路将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中重置门对应的算子中,得到重置门的中间结果,将该中间结果发送给主处理电路,所述主处理电路将该中间结果输入到另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
所述主处理电路将重置门的输出结果r t分发给从处理电路;
从处理电路将接收到的输入数据块、输出数据h t-1、权值、输出结果r t输入到部分算子中与当前记忆门对应的算子中,得到当前记忆门的中间结果,将当前记忆门的中间结果发送给主处理电路,所述主处理电路将当前记忆门的中间结果输入到另一部分算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
所述主处理电路将更新门的输出结果z t、当前记忆门的输出结果n t、输出数据h t-1输入到另一部分算子与输出层对应的算子中,得到输出结果h t
在一种可选的方案中,如所述控制器单元获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值时,所述方法还包括:所述控制器单元获取偏置,将偏置发送给所述主处理电路;所述主处理电路将偏置广播给从处理电路。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到重置门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块以及权值和偏置输入到第一乘法算子中,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第二乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将所述乘积结果进行累加运算,得到重置门的输入中间结果(W ir*x t+b ir),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hr*h t-1+b hr);
其中,第一乘法算子、第二乘法算子为部分算子中与重置门对应的算子,W ir、W hr、b ir、和b hr为权值和偏置中分别与重置门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;所述得到重置门的输出结果r t具体包括:
所述加法处理电路将重置门的输入中间结果和输出中间结果输入到第一加法算子中,对输入中间结果和输出中间结果执行求和运算,得到第一求和结果;
所述激活处理电路将第一求和结果输入到第一激活算子中,对第一求和结果执行sigmoid激活运算,得到重置门的输出结果r t
第一加法算子、第一激活算子为另一部分算子中与重置门对应的算子。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到更新门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块以及权值和偏置输入到第三乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第四乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将该乘积结果进行累加运算,得到更新门的输入中间结果(W iz*x t+b iz),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hz*h t-1+b hz);
其中,第三乘法算子、第四乘法算子为部分算子中与更新门对应的算子,W ir、W hz、b ir和b hz为权值和偏置中分别与更新门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;所述得到更新门的输出结果z t具体包括:
所述加法处理电路将更新门的输入中间结果和输出中间输入到第二加法算子中,对该输入中间结果和输出中间执行求和运算,得到第二求和结果;
所述激活处理电路将第二求和结果输入到第二激活算子中,对第二求和结果执行sigmoid激活运算,得到更新门的输出结果z t
第二加法算子、第二激活算子为另一部分算子中与更新门对应的算子。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到当前记忆门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块、权值和偏置输入到第五乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1、权值和偏置输入到第六乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将该乘积结果进行累加运算,得到当前记忆门的输入中间结果(W in*x t+b in),将另一乘积结果进行累加运算,得到当前记忆门的输出中间结果(W nz*h t-1+b nz);
所述乘法处理电路将重置门的输出结果r t输入到第一向量乘法算子中,对重置门的输出结果r t与当前记忆门的输出中间结果执行点乘运算,得到第一点乘结果;
其中,第五乘法算子、第六乘法算子、第一向量乘法算子为部分算子中与当前记忆门对应的算子,W in、W hn、b in和b hn为权值和偏置中分别与当前记忆门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;所述得到当前记忆门的输出结果n t具体包括:
所述加法处理电路将当前记忆门的输入中间结果和第一点乘结果输入到第三加法算子中,对当前记忆门的输入中间结果和第一点乘结果执行求和运算,得到第三求和结果;
所述激活处理电路将第三求和结果输入到第三激活算子中,对第三求和结果执行tanh激活运算,得到当前记忆门的输出结果n t
第三加法算子、第三激活算子为另一部分算子中与当前记忆门对应的算子。
在一种可选的方案中,所述主处理电路包括加法处理电路,所述从处理电路包括乘法处理电路;所述确定输出层的输出结果具体包括:
所述主处理电路将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1发送给从处理电路;
所述乘法处理电路将更新门的输出结果z t以及当前记忆门的输出结果n t输入到第二向量乘法算子中,对更新门的输出结果z t以及当前记忆门的输出结果n t执行点乘运算,得到第二点乘结果,将接收到的更新门的输出结果z t以及输出数据h t-1输入到第三向量乘法算子中,对更新门的输出结果z t以及输出数据h t-1执行点乘运算,得到第三点乘结果,将第二点乘结果和第三点乘结果发送给主处理电路;
所述加法处理电路将当前记忆门的输出结果n t以及第二点乘结果输入到第一减法算子中,对当前记忆门的输出结果n t以及第二点乘结果执行减法运算,得到第一差值结果,将第三点乘结果以及第一差值结果输入到第四加法算子,对第三点乘结果以及第一差值结果执行求和运算,得到输出结果h t
其中,第二向量乘法算子、第三向量乘法算子为部分算子中与输出层对应的算子,第一减法算子、第四加法算子为另一部分算子中与输出层对应的算子。
在一种可选的方案中,所述主处理电路包括转换处理电路;
所述转换处理电路将输出结果h t输入到另一部分算子中的整形算子和拆分算子中,将输出结果h t的数据格式调整为预设格式,得到最终输出结果。
第九方面,提供一种计算装置,所述计算装置用于执行GRU的运算,所述GRU包括:输入层、隐层、重置门、更新门、当前记忆门和输出层;
所述计算装置,用于获取输入层t时刻输入的输入数据x t、前一个GRU的隐层输入的输出数据h t-1和权值;
所述计算装置,用于从预先封装的函数库中调用预先构造的GRU算子;
所述计算装置,用于将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t
在一种可选的方案中,所述计算装置,在将输入数据x t、输出数据h t-1、权值输入到所述预先构造 的GRU算子中,得到输出结果h t时,具体用于:
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t
将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t
在一种可选的方案中,所述计算装置包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和至少一个从处理电路;
所述控制器单元,用于获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值;
所述控制器单元,用于从预先封装的函数库中调用预先构造的GRU算子;
所述控制器单元,用于将输入数据x t、输出数据h t-1、权值以及GRU算子发送给所述主处理电路;
所述主处理电路,用于将输入数据x t拆分为多个输入数据块、将输出数据h t-1拆分为多个输出数据h t-1,将多个输入数据块、输出数据h t-1分发给从处理电路,将权值以及GRU算子中的部分算子广播给从处理电路;
从处理电路,用于将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与重置门对应的算子中,得到重置门对应的中间结果,将该中间结果发送给主处理电路,主处理电路将该中间结果输入到GRU算子中的另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
从处理电路,用于将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与更新门对应的算子中,得到更新门的中间结果,将该中间结果发送给主处理电路,所述主处理电路将该中间结果输入到另一部分算子中与更新门对应的算子中,得到更新门的输出结果z t
所述主处理电路,用于将重置门的输出结果r t分发给从处理电路;
从处理电路将接收到的输入数据块、输出数据h t-1、权值、输出结果r t输入到部分算子中与当前记忆门对应的算子中,得到当前记忆门的中间结果,将当前记忆门的中间结果发送给主处理电路,所述主处理电路将当前记忆门的中间结果输入到另一部分算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
所述主处理电路,用于将更新门的输出结果z t、当前记忆门的输出结果n t、输出数据h t-1输入到另一部分算子与输出层对应的算子中,得到输出结果h t
在一种可选的方案中,所述控制器单元,如获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值时,所述控制器单元,还用于获取偏置,将偏置发送给所述主处理电路;所述主处理电路,还用于将偏置广播给从处理电路。
在一种可选的方案中,如从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
在一种可选的方案中,如从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路;
所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
在一种可选的方案中,如从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
所述K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果的转发。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;如得到重置门的输出中间结果,
所述乘法处理电路,用于将接收到的输入数据块以及权值和偏置输入到第一乘法算子中,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1、权值和偏置输入到第二乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路,用于将所述乘积结果进行累加运算,得到重置门的输入中间结果(W ir*x t+b ir),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hr*h t-1+b hr);
其中,第一乘法算子、第二乘法算子为部分算子中与重置门对应的算子,W ir、W hr、b ir、和b hr为权值和偏置中分别与重置门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;在得到重置门的输出结果r t时,
所述加法处理电路,用于将重置门的输入中间结果和输出中间结果输入到第一加法算子中,对输入中间结果和输出中间结果执行求和运算,得到第一求和结果;
所述激活处理电路,用于将第一求和结果输入到第一激活算子中,对第一求和结果执行sigmoid激活运算,得到重置门的输出结果r t
第一加法算子、第一激活算子为另一部分算子中与重置门对应的算子。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;在得到更新门的输出中间结果时,
所述乘法处理电路,用于将接收到的输入数据块以及权值和偏置输入到第三乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第四乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路,用于将该乘积结果进行累加运算,得到更新门的输入中间结果(W iz*x t+b iz),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hz*h t-1+b hz);
其中,第三乘法算子、第四乘法算子为部分算子中与更新门对应的算子,W ir、W hz、b ir和b hz为权值和偏置中分别与更新门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;在得到更新门的输出结果z t时,
所述加法处理电路,用于将更新门的输入中间结果和输出中间结果输入到第二加法算子中,对该输入中间结果和输出中间结果执行求和运算,得到第二求和结果;
所述激活处理电路,用于将第二求和结果输入到第二激活算子中,对第二求和结果执行sigmoid激活运算,得到更新门的输出结果z t
第二加法算子、第二激活算子为另一部分算子中与更新门对应的算子。
在一种可选的方案中,所述从处理电路包括:乘法处理电路和累加处理电路;在得到当前记忆门的 输出中间结果时,
所述乘法处理电路,用于将接收到的输入数据块、权值和偏置输入到第五乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第六乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路,用于将该乘积结果进行累加运算,得到当前记忆门的输入中间结果(W in*x t+b in),将另一乘积结果进行累加运算,得到当前记忆门的输出中间结果(W nz*h t-1+b nz);
所述乘法处理电路,用于将重置门的输出结果r t输入到第一向量乘法算子中,对重置门的输出结果r t与当前记忆门的输出中间结果执行点乘运算,得到第一点乘结果;
其中,第五乘法算子、第六乘法算子、第一向量乘法算子为部分算子中与当前记忆门对应的算子,W in、W hn、b in和b hn为权值和偏置中分别与当前记忆门对应的第一权值、第二权值、第一偏置和第二偏置。
在一种可选的方案中,所述主处理电路包括激活处理电路和加法处理电路;在得到当前记忆门的输出结果n t时,
所述加法处理电路,用于将当前记忆门的输入中间结果和第一点乘结果输入到第三加法算子中,对当前记忆门的输入中间结果和点乘结果执行求和运算,得到第三求和结果;
所述激活处理电路,用于将第三求和结果输入到第三激活算子中,对第三求和结果执行tanh激活运算,得到当前记忆门的输出结果n t
第三加法算子、第三激活算子为另一部分算子中与当前记忆门对应的算子。
在一种可选的方案中,所述主处理电路包括加法处理电路,所述从处理电路包括乘法处理电路;在得到输出层的输出结果h t时,
所述主处理电路,用于将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1发送给从处理电路;
所述乘法处理电路,用于将更新门的输出结果z t以及当前记忆门的输出结果n t输入到第二向量乘法算子,对更新门的输出结果z t以及当前记忆门的输出结果n t执行点乘,得到第二点乘结果,将接收到的更新门的输出结果z t以及输出数据h t-1输入到第三向量乘法算子,对更新门的输出结果z t以及输出数据h t-1执行点乘运算,得到第三点乘结果,将第二点乘结果和第三点乘结果发送给主处理电路;
所述加法处理电路,用于将当前记忆门的输出结果n t以及第二点乘结果输入到第一减法算子中,对当前记忆门的输出结果n t以及点乘结果执行减法运算,得到第一差值结果,将第三点乘结果以及第一差值结果输入到第四加法算子,对第三点乘结果以及第一差值结果执行求和,得到输出结果ht;
其中,第二向量乘法算子、第三向量乘法算子为部分算子中与输出层对应的算子,第一减法算子、第四加法算子为另一部分算子中与输出层对应的算子。
在一种可选的方案中,所述主处理电路包括转换处理电路;
所述转换处理电路,用于将输出结果h t输入到另一部分算子中的整形算子和拆分算子,将输出结果h t的数据格式调整为预设格式,得到最终输出结果。
第十方面,提供一种神经网络芯片,其特征在于,所述神经网络芯片包括第九方面提供的计算装置。
第十一方面,提供一种电子设备,所述电子设备包括第十方面提供的芯片。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1-1为一种LSTM的结构示意图
图1-2是本申请实施例提供的一种计算装置的结构示意图。
图1-2a是本申请实施例提供的一种运算单元的结构示意图。
图1-3是本申请提供的另一种计算装置的结构示意图。
图1-3a是本申请提供的主处理电路的结构示意图。
图1-4a是本申请提供的一种树型模块发送端的结构示意图。
图1-4b是本申请提供的一种树型模块接收端的结构示意图。
图1-4c是本申请提供的二叉树结构示意图。
图1-5是本申请一个实施例提供的计算装置的结构图。
图1-6是本申请一个实施例提供的LSTM运算方法的流程示意图。
图1-7是本申请实施例提供的一种组合处理装置的结构图。
图1-8是本申请实施例提供的另一种组合处理装置的结构图。
图1-9是本申请实施例提供的一种板卡的结构示意图。
图2-1为一种GRU的结构示意图
图2-2是本申请实施例提供的一种计算装置的结构示意图。
图2-2a是本申请实施例提供的一种运算单元的结构示意图。
图2-3是本申请提供的另一种计算装置的结构示意图。
图2-3a是本申请提供的主处理电路的结构示意图。
图2-3b是本申请提供的从处理电路的结构示意图。
图2-4a是本申请提供的一种树型模块发送端的结构示意图。
图2-4b是本申请提供的一种树型模块接收端的结构示意图。
图2-4c是本申请提供的二叉树结构示意图。
图2-5是本申请一个实施例提供的计算装置的结构图。
图2-6是本申请一个实施例提供的GRU的运算方法的流程示意图。
图2-7是本申请实施例提供的一种组合处理装置的结构图。
图2-8是本申请实施例提供的另一种组合处理装置的结构图。
图2-9是本申请实施例提供的一种板卡的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参阅图1-1,图1-1为一种LSTM的示意图,如图1-1所示,该LSTM包括:输入门、忘记门、更新状态单元和输出门,其对应的计算公式如下:
f t=σ(W f[h t-1,x t]+b f
i t=σ(W i[h t-1,x t]+b i
g t=tanh(W g[h t-1,x t]+b g
O t=σ(W o[h t-1,x t]+b o
h t=O t⊙tanh(Ct)
C t=C t-1⊙f t+g t⊙i t
其中,xt为第t时刻的输入数据,ht-1表示t-1时刻的输出数据,Wf、Wi、Wg和Wo分别表示遗忘门、输入门、更新状态单元和输出门所对应的权值向量,bf、bi、bc和bo分别表示忘记门、输入门、更新状态单元和输出门所对应偏置;ft表示忘记门的输出,与t-1时刻的状态单元进行点乘来有选择的
遗忘过去的状态单元值;it表示输入门的输出,与t时刻的得到的候选状态值点乘来有选择地将t时刻的候选状态值加入到更新状态单元中;gt表示t时刻计算得到的候选状态值;ct表示通过将t-1时刻的状态值有选择的遗忘和将t时刻的状态值有选择的加入得到的新的状态值,ct将在计算最终输出时刻被使用并传输到下一时刻;Ot表示t时刻更新状态单元中需要作为结果部分输出的选择条件;ht表示t时刻的输出,同时它还将被传输到下一时刻(即t+1时刻);⊙为向量按元素运算的乘积;σ为sigmoid函数,计算公式为:
Figure PCTCN2019105932-appb-000001
激活函数tanh函数的计算公式为
Figure PCTCN2019105932-appb-000002
在具体计算的时候,本申请将Wf、Wi、Wg和Wo拼成一个矩阵W,bf、bi、bc和bo拼成一个矩阵b。
参阅图1-2,图1-2为本申请提供的计算装置。参阅图1-2,提供了一种计算装置,该计算装置用于执行LSTM运算,该计算装置包括:控制器单元11、运算单元12和存储单元10,其中,控制器单元11与运算单元12、存储单元10连接,该运算单元12包括:一个主处理电路101和从处理电路102(可以为一个或多个从处理电路,优先选择多个从处理电路);
需要说明的,上述主处理电路自身包含有存储器(例如内存或寄存器),该存储器可以存储主处理电路的一些数据,从处理电路可以选择携带存储器。
LSTM包括:输入门、忘记门、输出门和更新状态门;
存储单元10,用于存储LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;
控制器单元11,用于获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
运算单元12,用于依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
可选的,上述所述控制器单元,具体用于根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;
所述主处理电路,具体用于依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;
所述从处理电路,用于依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;
需要说明的是,上述各个门中每个门的运算是相对独立的,计算结果也相对独立,即每个门均具有各自的权值数据,例如Wf、Wi、Wg和Wo分别代表4个门的权值数据。
上述依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果具体可以包括:
将多个输入数据块与输入门权值数据执行乘法运算得到输入门的中间结果,多个输入数据块与输出门权值数据执行乘法运算得到输出门的中间结果,多个输入数据块与忘记门权值数据执行乘法运算得到 忘记门的中间结果,多个输入数据块与更新状态门权值数据执行乘法运算得到更新状态门的中间结果。上述各个门的状态中间结果与各个门的中间结果类似,这里不再赘述。
所述主处理电路,用于依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
本申请提供的技术方案将运算单元设置成主从结构,对于LSTM的正向运算,将本时刻的输入数据以及忘记门的输出数据拆分并行处理,这样通过主处理电路以及从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗。
可选的,所述主处理电路,具体用于依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
可选的,所述主处理电路,具体用于依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
可选的,所述后续处理具体包括:
如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;
如为更新状态门,所述后续处理为激活运算tanh函数。
可选的,所述主处理电路,还用于将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
上述LSTM可以包含多个隐层,h为大于等于2的整数,对于第h个隐层可以为LSTM中的任意一个中间隐层的运算,多个LSTM运算,其实现过程是,在正向运算中,当上一时刻t-1执行完成正向运算之后得到输出结果t-1,当前时刻t的运算算子会将上一时刻输出结果t-1作为下一时刻的忘记门的输入数据,忘记门通过sigmoid来确定以上时刻输出结果t-1的通过率,这样即得到了忘记门t时刻的输出结果t,将输出结果t与权值进行运算,另一部分运算为时刻t输入层的输入数据作为另一部分输入神经元,然后将两部分输入神经元分别与权值执行乘积运算得到两个运算结果,将两个运算结果相加即得到时刻t的输出结果,然后将时刻t的输出结果作为下一时刻t+1忘记门的输入数据,这样即能够有选择的确定上一时刻的结果的通过率。
可选的,上述计算装置还可以包括:直接内存访问单元50,存储单元10可以包括:寄存器、缓存中的一个或任意组合,具体的,所述缓存,用于存储计算算子;所述寄存器,用于存储所述输入数据和标量;所述缓存为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。
可选的,该控制器单元包括:算子存储单元110、算子处理单元111和存储队列单元113;
算子存储单元110,用于存储所述LSTM运算关联的计算算子;
所述算子处理单元111,用于对所述计算算子解析得到多个运算算子;
存储队列单元113,用于存储算子队列,该算子队列包括:按该队列的前后顺序待执行的多个运算算子或多个计算算子。
可选的,该控制器单元还可以包括:
所述依赖关系处理单元108,用于在具有多个运算算子时,确定第一运算算子与所述第一运算算子之前的第零运算算子是否存在关联关系,如所述第一运算算子与所述第零运算算子存在关联关系,则将所述第一运算算子缓存在所述算子存储单元内,在所述第零运算算子执行完毕后,从所述算子存储单元提取所述第一运算算子传输至所述运算单元;
所述确定该第一运算算子与第一运算算子之前的第零运算算子是否存在关联关系包括:
依据所述第一运算算子提取所述第一运算算子中所需数据(例如矩阵)的第一存储地址区间,依据 所述第零运算算子提取所述第零运算算子中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算算子与所述第零运算算子具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算算子与所述第零运算算子不具有关联关系。
在另一种可选实施例中,运算单元12如图1-3所示,可以包括一个主处理电路101和多个从处理电路102。在一个实施例里,如图1-3所示,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,需要说明的是,如图1-3所示的K个从处理电路仅包括第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,即该k个从处理电路为多个从处理电路中直接与主处理电路连接的从处理电路。
K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据(该数据可以为输入数据块、输入状态数据块、中间结果、状态中间结果等等)以及算子转发。
可选的,如图1-3a所示,该主处理电路还可以包括:转换处理电路110、激活处理电路111、加法处理电路112中的一种或任意组合;
转换处理电路110,用于数据执行转换处理,具体为:将主处理电路接收的数据(包括但不限于:输入数据Xt、权值数据(各个门的权值)、输入状态值Ct-1、输入结果ht-1)执行第一数据结构与第二数据结构之间的互换(例如连续数据与离散数据的转换,例如浮点数据与定点数据的转换)。
激活处理电路111,用于执行主处理电路内数据的激活运算;
加法处理电路112,用于执行加法运算或累加运算。
另一个实施例里,该运算算子为矩阵乘以矩阵的算子、累加算子、激活算子等等计算算子。
在一种可选的实施方案中,如图1-4a所示,所述运算单元包括:树型模块40,所述树型模块包括:一个根端口401和多个支端口404,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
上述树型模块具有收发功能,例如如图1-4a所示,该树型模块即为发送功能,如图1-4b所示,该树型模块即为接收功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据(该数据可以为输入数据块、输入状态数据块、中间结果、状态中间结果等等)。
可选的,该树型模块为计算装置的可选择结果,其可以包括至少1层节点,该节点为具有转发功能的线结构,该节点本身可以不具有计算功能。如树型模块具有零层节点,即无需该树型模块。
可选的,该树型模块可以为n叉树结构,例如,如图1-4c所示的二叉树结构,当然也可以为三叉树结构,该n可以为大于等于2的整数。本申请具体实施方式并不限制上述n的具体取值,上述层数也可以为2,从处理电路可以连接除倒数第二层节点以外的其他层的节点,例如可以连接如图1-4c所示的倒数第一层的节点。
可选的,上述运算单元可以携带单独的缓存,如图1-2a所示,可以包括:神经元缓存单元,该神经元缓存单元63缓存该从处理电路的输入神经元向量数据和输出神经元值数据。
如图1-2a,该运算单元还可以包括:权值缓存单元64,用于缓存该从处理电路在计算过程中需要的权值数据。
在一种可选实施例中,运算单元12如图1-5所示,可以包括分支处理电路103;其具体的连接结构如图1-5所示,其中,
上述分支处理电路103可以包括存储器,如图1-5所示,分支处理电路103的存储器的大小可以为在单个从处理电路需要存储的最大数据容量的2到2.5倍之间,这样设置以后,从处理电路即无需设置存储器,相对于一个分支处理电路,其只用设置2.5*R(单个从处理器电路所需的容量值),如果没有分支处理电路,那么需要设置4*R,并且其寄存器的利用率还低,因此该结构可以有效的降低存储器的总容量,降低成本。
所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的(该数据可以为输入数据块、输入状态数据块、中间结果、状态中间结果等等)。
下面通过一个实例的例子来说明上述输入数据的拆分的方式(上述输入状态数据的拆分也可以参见输入数据的拆分),对于输出结果与输入数据因为数据类型相同,其拆分的方式基本相同,假设该数据类型为矩阵,该矩阵为H*W,则拆分的方式可以为,如H的数值较小(小于设定阈值,例如100),那么在沿H方向将矩阵H*W拆分成H个向量(每个向量为矩阵H*W的一行),每个向量即为一个输入数据块,并对输入数据块的第一元素的位置标记在输入数据块,即输入数据块h,w,其中,h、w分别为输入数据块h,w的第一元素在H方向以及W方向的值,例如第一输入数据块,该h=1.w=1。从处理电路接收到输入数据块h,w后,将输入数据块h,w与权值每列元素一一对应相乘和累加运算得到中间结果w,i,中间结果的w为输入数据块的w值,i为与输入数据块计算的列元素的列数值,主处理电路确定中间结果在对应门的运算结果的位置为w、i。例如,输入数据块1,1与权值第一列计算得到的输入中间结果1,1,主处理电路将输入中间结果1,1排列在对应门的运算结果第一行第一列。
本申请还提供一种LSTM运算方法,所述方法应用于计算装置,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;所述存储单元存储:LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;所述方法包括如下步骤:
步骤S601、控制器单元获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
步骤S602、运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
可选的,所述运算单元包括:主处理电路以及从处理电路;所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果具体包括:
所述控制器单元根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;
所述主处理电路依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;
所述从处理电路依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;
所述主处理电路依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
可选的,依据输入状态值Ct-1以及各个门的输出结果得到输出状态值Ct具体包括:
所述主处理电路依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
可选的,所述依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht具体包括:
所述主处理电路依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果 Ot与激活结果相乘得到输出结果ht。
本申请还揭露了一个LSTM装置,其包括一个或多个在本申请中提到的计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的LSTM运算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上计算装置时,计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的卷积神经网络训练的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该LSTM装置具有较高的兼容性,可通过PCIE接口与各种类型的服务器相连接。
本申请还揭露了一个组合处理装置,其包括上述的LSTM装置,通用互联接口,和其他处理装置。LSTM运算装置与其他处理装置进行交互,共同完成用户指定的操作。图1-7为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为LSTM运算装置与外部数据和控制的接口,包括数据搬运,完成对本LSTM运算装置的开启、停止等基本控制;其他处理装置也可以和LSTM运算装置协作共同完成运算任务。
通用互联接口,用于在所述LSTM装置与其他处理装置间传输数据和控制算子。该LSTM装置从其他处理装置中获取所需的输入数据,写入LSTM装置片上的存储装置;可以从其他处理装置中获取控制算子,写入LSTM装置片上的控制缓存;也可以读取LSTM装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图1-8所示,还可以包括存储装置,存储装置分别与所述LSTM装置和所述其他处理装置连接。存储装置用于保存在所述LSTM装置和所述其他处理装置的数据,尤其适用于无法全部保存的所需要运算的数据在本LSTM装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还申请了一种芯片,其包括了上述LSTM装置或组合处理装置。
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图1-9,图1-9提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一些实施例里,申请了一种电子设备,其包括了上述板卡。
电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
对于上述的LSTM,在实际应用中,还可以产生很多的变形,例如如图2-1所示,即为一种LSTM的变形,图2-1为本申请实施例提供的一种GRU的示意图,如图2-1所示,该GRU(门控循环单元,Gated Recurrent Unit)包括:输入层、隐层、重置门、更新门、当前记忆门和输出层,其中,输入层分别与重置门、更新门和当前记忆门连接,上一个GRU单元的隐层分别与当前GRU单元重置门、更新门、当前记忆门以及输出层连接,GRU为LSTM(长短期记忆网络,Long Short-Term Memory)的一种变形体,图2-1中重置门的输出结果z t用于控制前一时刻的状态信息被带入到当前状态中的程度,重置门的输出结果r t用于控制前一状态有多少信息被写入到当前记忆门的输出结果n t上,重置的输出结果r t门越小,前一状态的信息被写入的越少,通过重置门和更新门的控制,合适的信息将会被写入输出结果h t中,并保存到GRU的隐层中,并传递到下一个GRU单元,这样就解决了循环神经网络随着时间衰减的问题。
参阅图2-2,图2-2为本申请实施例提供的一种计算装置,该计算装置用于执行GRU运算,该GRU包括:输入层、隐层、重置门、更新门、当前记忆门和输出层;
所述计算装置,用于获取输入层t时刻输入的输入数据x t、前一个GRU的隐层输入的输出数据h t-1和权值;
所述计算装置,用于从预先封装的函数库中调用预先构造的GRU算子;
所述计算装置,用于将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t
本申请提供的技术方案将GRU的运算过程预先编译成对应的算子,从而实现在MLU上执行GRU的运算,无需CPU对指令译码以及数据内存的访问,提高了GRU的运算速度,提高了运行效率。
可选的,所述计算装置,在将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t时,具体用于:
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t
将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t
在一可能的示例中,在t=0时,即x t为第零时刻的输入数据时,输入的输出数据h -1为预先设置的一个初始化值,且在GRU为多层GRU时,输入的输出数据h t-1为一个初始化的向量,主处理电路在将输入数据x t拆分为多个输入数据块时,需将该输出数据h t-1拆分为多个输出数据块,并将该多个输出数 据块分发到与每层GRU的对应的从处理电路中,以保证计算每层GRU在t=0的输出结果h 0时,接收到的输出数据h -1不同;当t>0时,计算本层的GRU在t时刻的输出时,在接收到上一个GRU的隐层输入的输出数据h t-1后,由于在得到每层GRU的输出结果h t,主处理电路会将该层的输出结果h t输入到整形算子和拆分算子中,得到最终输出结果,故本层GRU接收到的上一个GRU的隐层输入的输出数据h t-1本质上为已经拆分好的多个输出数据块,所以,主处理电路无需对输出数据h t-1进行数据的拆分操作,只需将接收到的输出数据h t-1分发到对应的从处理电路,即可执行本层GRU的运算过程。
其中,算子是一个函数空间到另一个函数空间上的映射。
其中,预先构造算子的理由是:要在机器学习处理器MLU(Machine Learning processor Unit,MLU)实现GRU的运算。该机器学习处理器MLU应用于机器学习运算,其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等,该机器学习处理器MLU具体可以包括NPU(Neural-Network Processing Unit,神经网络处理器单元)、DSP(Digital Signal Process,数字信号处理)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片的一种或组合。而MLU的数据是封装好的,无法获取到数据的存储地址,故无法像在CPU上通过对指令译码的方式,使用for循环移动指针来实现GRU的运算。可以理解,MLU执行的运算往往需要构造算子来完成,MLU上的算子较难复用,故预先将GRU的每个运算过程编译成与其对应的算子,得到多个算子,将编译好的多个算子预先封装在函数库中,在执行GRU运算时,通过函数接口从预先封装的函数库中调取相应的GRU算子,将输入数据输入到调取的GRU算子中,执行与GRU算子对应的运算过程,得到输出结果。例如,在MLU上执行a和b的加法操作得到c时,需提前构造一个加法算子,将a和b输入到该加法算子中,执行求和操作,得到c,如果需要执行另外一个加法操作,需要再构造一个加法算子。
可选的,GRU的运算如下所示:
r t=sigmoid(W ir*x t+b ir+W hr*h t-1+b hr);
z t=sigmoid(W iz*x t+b iz+W hz*h t-1+b hz);
n t=tanh(W in*x t+b in+r t·(W hn*h t-1+b hn));
h t=(1-z t)·n t+z t·h t-1。
其中,x t为t时刻的输入数据,h t-1为前一个GRU的隐层输入的输出数据,r t表示重置门的输出,z t表示更新门的输出,n t表示当前记忆门的输出,h t表示t时刻的输出结果,W r、W z和W n分别表示与重置门、更新门、当前记忆门对应的权值,b r、b z和b n分别表示重置门、更新门、当前记忆门所对应偏置,W ir、W hr、b ir、b hr为分别与重置门对应的第一权值、第二权值、第一偏置、第二偏置,W iz、W hz、b iz、b hz为分别与更新门对应的第一权值、第二权值、第一偏置、第二偏置,W in、W hn、b in、b hn为分别与当前记忆门对应的第一权值、第二权值、第一偏置、第二偏置。
现有技术中,在执行GRU的运算时,需要先将W ir和W hr拼接为W r,将W iz和W hz拼接为W z,将W in和W hn拼接为W n,以及将b ir和b hr拼接为b r,将b iz和b hz拼接为b z,将b in和b hn拼接为b n,即W r=[W ir,W hr],W z=[W iz,W hz],W n=[W in,W hn],b r=[b ir,b hr],b z=[b iz,b hz],b n=[b in,b hn],其中,[]表示向量拼接的拼接算法,得到与重置门、更新门以及当前记忆门对应的权值W r、W z、W n以及偏置b r、b z、b n后,再与输入数据x t和输出数据h t-1进行相应的运算,得到输出结果h t,由于本申请中通过构造算子的方式来实现GRU的每一步运算过程,如进行向量拼接,则在调用每个算子进行运算时,需将拼接好的权值与偏置拆分,得到每个算子需要的权值与偏置,进行了无效的拼接和拆分过程,影响运算速度,所以本申请在获取输入的权值和偏置后,将权值和偏置预先拆分为成与重置门、更新门以及当前记忆门对应的权值和偏置块,并对各个权值和偏置块添加与各个门、输入数据h t以及输出数据h t-1对应的标识信息,在计算每个门的输出结果时依据标识信息查询到与该门对应的权值和偏置后,直接与对应的输入数据和输出数据进行运算,保证在MLU上执行GRU运算的同时,提高GRU的运算速度,提高运算效率。
可以理解的,基于上述GRU的运算过程,构造下面GRU算子实现在MLU上执行GRU的运算:
具体来讲,构造与重置门对应的算子,具体为:构造第一乘法算子(W ir*x t+b ir)、第二乘法算子 (W hr*h t-1+b hr)、第一加法算子,用于对第一乘法算子和第二乘法算子的输出结果求和,第一激活算子,用于激活第一加法算子的输出结果,得到重置门的输出r t,第一激活算子的激活类型为sigmoid;构造与更新门对应的算子,具体为:构造第三乘法算子W iz*x t+b iz、第四乘法算子W hz*h t-1+b hz,第二加法算子,用于对第三乘法算子和第四乘法算子的输出结果求和,第二激活算子,用于激活第二加法算子的输出结果,得到更新门的输出z t,第二激活算子的激活类型为sigmoid;构造与当前记忆门对应的算子,具体为:构造第五乘法算子W in*x t+b in和第六乘法算子W hn*h t-1+b hn、第一向量乘法算子r t·(W hn*h t-1+b hn),即用于对第六乘法算子的输出结果与r t执行点乘,第三加法算子,用于对第五乘法算子和第一向量乘法算子的输出结果求和,第三激活算子,用于激活第三加法算子的输出结果,得到当前记忆门的输出结果n t,第三激活算子的激活类型为tanh;构造与输出层对应的算子,具体为:构造第二向量乘法算子,对z t和n t执行点乘,计算z t·n t,第一减法算子,用于对n t和第二向量乘法算子的输出结果执行减法法,计算(n t-z t·n t),即(1-z t)·n t,第三向量乘法算子,对z t和h t-1执行点乘,计算z t·h t-1,第四加法算子,对第三向量乘法算子的输出结果和第一减法算子的输出结果求和,计算(1-zt)·n t+z t·h t-1,得到t时刻的输出结果h t
可选的,在从预先封装的函数库中调用预先构造的GRU算子之前,所述计算装置还用于获取偏置。
可选的,在得到重置门的输出结果时,所述计算装置,具体用于:获取所述GRU算子中与重置门对应的第一乘法算子、第二乘法算子、第一加法算子以及第一激活算子,所述第一激活算子的激活类型为sigmoid;将输入数据x t、权值以及偏置输入到所述第一乘法算子中,计算(W ir*x t+b ir),得到第一运算结果,其中,W ir和b ir为权值和偏置中分别与重置门对应的第一权值和第一偏置;将输出数据h t-1、权值以及偏置输入到所述第二乘法算子中,计算(W hr*h t-1+b hr),得到第二运算结果,其中,W hr和b hr为权值和偏置中分别与重置门对应的第二权值和第二偏置;将所述第一运算结果和所述第二运算结果输入到所述第一加法算子中求和,得到第一求和结果;将所述第一求和结果输入到所述第一激活算子中激活,得到重置门的输出结果r t
可选的,在得到更新门的输出结果时,所述计算装置,具体用于:获取所述GRU算子中与更新门对应的第三乘法算子、第四乘法算子、第二加法算子以及第二激活算子,所述第二激活算子的激活类型为sigmoid;将输入数据x t、权值以及偏置输入到所述第三乘法算子中,计算(W iz*x t+b iz),得到第三运算结果,其中,W ir和b ir为权值和偏置中分别与更新门对应的第一权值和第一偏置;将输出数据h t-1、权值以及偏置输入到所述第四乘法算子中,计算(W hz*h t-1+b hz),得到第四运算结果,其中,W hz和b hz为权值和偏置中分别与更新门对应的第二权值和第二偏置;将所述第三运算结果和所述第四运算结果输入到所述第二加法算子中,得到第二求和结果;将所述第二求和结果输入到所述第二激活算子中激活,得到更新门的输出结果z t
可选的,在得到当前记忆门的输出结果n t时,所述计算装置,具体用于:获取所述GRU算子中与当前记忆门对应的第五乘法算子、第六乘法算子、第一向量乘法算子、第三加法算子、第三激活算子,所述第三激活算子的激活类型为tanh;将输入数据x t、权值以及偏置输入到所述第五乘法算子,计算(W in*x t+b in),得到第五运算结果,其中,W in和b in为权值和偏置中分别与当前记忆门对应的第一权值和第一偏置;将输出数据h t-1、权值以及偏置输入到所述第六乘法算子,计算(W hn*h t-1+b hn),得到第六运算结果,其中,W hn和b hn为权值和偏置中分别与当前记忆门对应的第二权值和第二偏置;将所述第六运算结果以及重置门的输出结果r t输入到所述第一向量乘法算子,对重置门的输出数据r t与所述第六运算结果进行点乘,得到第一点乘结果;将所述第一点乘结果与所述第五运算结果输入到所述第三加法算子中求和,得到第三求和结果;将所述第三求和结果输入到所述第三激活算子激活,得到当前记忆门的输出结果n t
可选的,所述计算装置,具体用于:获取所述GRU算子中与输出层对应的第二向量乘法算子、第一减法算子、第三向量乘法算子、第四加法算子;将更新门的输出结果z t以及当前记忆门的输出结果n t输入到所述第二向量乘法算子,进行点乘运算,得到第二点乘结果;将所述当前记忆门的输出结果n t以及所述第二点乘结果输入到所述第一减法算子,执行减法运算,得到第一差值结果;将更新门的输出 结果z t和输出数据h t-1输入到所述第三向量乘法算子,进行点乘运算,得到第三点乘结果;将所述第一差值结果和所述第三点乘结果输入到所述第四加法算子求和,得到输出结果h t
可选的,如图2-2所示,上述计算装置具体包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和至少一个从处理电路;
所述控制器单元,用于获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值;
所述控制器单元,用于从预先封装的函数库中调用预先构造的GRU算子;
所述控制器单元,用于将输入数据x t、输出数据h t-1、权值以及GRU算子发送给所述主处理电路;
所述主处理电路,用于将输入数据x t拆分为多个输入数据块、将输出数据h t-1拆分为多个输出数据h t-1,将多个输入数据块、输出数据h t-1分发给从处理电路,将权值以及GRU算子中的部分算子广播给从处理电路;
从处理电路,用于将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与重置门对应的算子中,得到重置门对应的中间结果,将该中间结果发送给主处理电路,主处理电路将该中间结果输入到GRU算子中的另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
从处理电路,用于将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与更新门对应的算子中,得到更新门的中间结果,将该中间结果发送给主处理电路,所述主处理电路将该中间结果输入到另一部分算子中与更新门对应的算子中,得到重置门的输出结果r t
所述主处理电路,用于将重置门的输出结果r t分发给从处理电路;
从处理电路将接收到的输入数据块、输出数据h t-1、权值、输出结果r t输入到部分算子中与当前记忆门对应的算子中,得到当前记忆门的中间结果,将当前记忆门的中间结果发送给主处理电路,所述主处理电路将当前记忆门的中间结果输入到另一部分算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
所述主处理电路,用于将更新门的输出结果z t、当前记忆门的输出结果n t、输出数据h t-1输入到另一部分算子与输出层对应的算子中,得到输出结果h t
可选的,上述计算装置还可以包括:存储单元10和直接内存访问单元50,存储单元10可以包括:寄存器、缓存中的一个或任意组合,具体的,所述缓存,用于存储计算指令;所述寄存器,用于存储所述输入数据和标量;所述缓存为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。
可选的,该控制器单元包括:指令存储单元110、指令处理单元111和存储队列单元113;
指令存储单元110,用于存储GRU运算关联的GRU算子;
所述指令处理单元111,用于对所述GRU算子解析得到多个GRU算子;
存储队列单元113,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个GRU算子。
上述寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为多维(2维以上)数据。
可选的,该控制器单元还可以包括:
所述依赖关系处理单元108,用于在具有多个GRU算子时,确定第一GRU算子与所述第一GRU算子之前的第零GRU算子是否存在关联关系,如所述第一GRU算子与所述第零GRU算子存在关联关系,则将所述第一GRU算子缓存在所述指令存储单元内,在所述第零GRU算子执行完毕后,从所述指令存储单元提取所述第一GRU算子传输至所述运算单元;
所述确定该第一GRU算子与第一GRU算子之前的第零运算指令是否存在关联关系包括:
依据所述第一GRU算子提取所述第一GRU算子中所需数据(例如矩阵)的第一存储地址区间,依据所述第零GRU算子提取所述第零GRU算子中所需矩阵的第零存储地址区间,如所述第一存储地址区 间与所述第零存储地址区间具有重叠的区域,则确定所述第一GRU算子与所述第零GRU算子具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一GRU算子与所述第零GRU算子不具有关联关系。
在另一种可选实施例中,运算单元12如图2-3所示,可以包括一个主处理电路101和多个从处理电路102。在一个实施例里,如图2-3所示,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个从处理电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,需要说明的是,如图2-3所示的K个从处理电路仅包括第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路,即该k个从处理电路为多个从处理电路中直接与主处理电路连接的从处理电路。
K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果的转发。
可选的,如图2-3a所示,主处理电路101还可以包括:转换处理电路110、激活处理电路111、加法处理电路112中的一种或任意组合;
转换处理电路110,用于数据执行转换处理,具体为:在执行GRU运算之前,转换处理电路110,具体用于:获取主处理电路101接收的整形算子和拆分算子,将主处理电路101接收的输入数据x t、输出数据h t-1权值以及偏置调整为预设的四维张量格式,即执行第一数据结构与第二数据结构之间的互换(例如连续数据与离散数据的转换);在得到输出结果h t时,将输出结果h t输入到另一部分算子中的整形算子和拆分算子,将输出结果h t调整为预设格式(即四维张量格式),得到最终输出结果。
激活处理电路111,用于执行主处理电路内数据的激活运算;
加法处理电路112,用于执行加法运算或累加运算。
可选的,如图2-3b所示,从处理电路102还可以包括:乘法处理电路120和累加处理电路121中的一种或任意组合;
乘法处理电路120,用于执行从处理电路内数据的乘法运算,如向量和向量的点乘运算、矩阵和矩阵点乘运算、矩阵和矩阵的卷积运算、矩阵和向量的卷积运算,等等;
累加处理电路121,用于执行累加运算。
另一个实施例里,该GRU算子中所要执行的计算指令为矩阵乘以矩阵的指令、累加指令、激活指令等等计算指令。
在一种可选的实施方案中,如图2-4a所示,所述运算单元包括:树型模块40,所述树型模块包括:一个根端口401和多个支端口404,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
上述树型模块具有收发功能,如图2-4a所示,该树型模块即为发送功能,如图2-4b所示,该树型模块即为接收功能。
所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
可选的,该树型模块为计算装置的可选择结果,其可以包括至少1层节点,该节点为具有转发功能的线结构,该节点本身可以不具有计算功能。如树型模块具有零层节点,即无需该树型模块。
可选的,该树型模块可以为n叉树结构,例如,如图2-4c所示的二叉树结构,当然也可以为三叉树结构,该n可以为大于等于2的整数。本申请具体实施方式并不限制上述n的具体取值,上述层数也可以为2,从处理电路可以连接除倒数第二层节点以外的其他层的节点,例如可以连接如图2-4c所示的倒数第一层的节点。
可选的,运算单元12可以携带单独的缓存,如图2-2a所示,可以包括:神经元缓存单元,该神经元缓存单元63缓存该从处理电路的输入神经元向量数据和输出神经元值数据。
如图2-2a,该运算单元还可以包括:权值缓存单元64,用于缓存该从处理电路在计算过程中需要的权值数据。
在一种可选实施例中,运算单元12如图2-5所示,可以包括分支处理电路103;其具体的连接结构如图2-5所示,其中,
上述分支处理电路103可以包括存储器,如图2-5所示,分支处理电路103的存储器的大小可以为在单个从处理电路需要存储的最大数据容量的2到2.5倍之间,这样设置以后,从处理电路即无需设置存储器,相对于一个分支处理电路,其只用设置2.5*R(单个从处理器电路所需的容量值),如果没有分支处理电路,那么需要设置4*R,并且其寄存器的利用率还低,因此该结构可以有效的降低存储器的总容量,降低成本。
所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
下面通过一个实际的例子来说明上述输入数据的拆分的方式,对于输出结果与输入数据因为数据类型相同,其拆分的方式基本相同,假设该数据类型为矩阵,该矩阵为H*W,则拆分的方式可以为,如H的数值较小(小于设定阈值,例如100),那么在沿H方向将矩阵H*W拆分成H个向量(每个向量为矩阵H*W的一行),每个向量即为一个输入数据块,并对输入数据块的第一元素的位置标记在输入数据块,即输入数据块h,w,其中,h、w分别为输入数据块h,w的第一元素在H方向以及W方向的值,例如第一输入数据块,该h=1.w=1。从处理电路接收到输入数据块h,w后,将输入数据块h,w与权值每列元素一一对应相乘和累加运算得到输入中间结果w,i,中间结果的w为输入数据块的w值,i为与输入数据块计算的列元素的列数值,主处理电路确定中间结果在隐层输出结果的位置为w、i。例如,输入数据块输入数据块1,1与权值第一列计算得到的输入中间结果1,1,主处理电路将输入中间结果1,1排列在隐层输出结果第一行第一列。
下面详细叙述在MLU上运算GRU的过程:
可选的,在得到重置门的输出中间结果时:乘法处理电路120,用于将接收到的输入数据块以及权值和偏置输入到第一乘法算子中,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第二乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另元素值执行求和运算,得到另一乘积结果;累加处理电路121,用于将所述乘积结果进行累加运算,得到重置门的输入中间结果(W ir*x t+b ir),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hr*h t-1+b hr);
其中,第一乘法算子、第二乘法算子为部分算子中与重置门对应的算子,W ir、W hr、b ir、和b hr为权值和偏置中分别与重置门对应的第一权值、第二权值、第一偏置和第二偏置。
可选的,在得到重置门的输出结果r t时,加法处理电路112,用于将重置门的输入中间结果和输出中间结果输入到第一加法算子中,对输入中间结果和输出中间结果执行求和运算,得到第一求和结果;激活处理电路111,用于将第一求和结果输入到第一激活算子中,对第一求和结果执行sigmoid激活运算,得到重置门的输出结果r t
第一加法算子、第一激活算子为另一部分算子中与重置门对应的算子。
可选的,在得到更新门的输出中间结果时,乘法处理电路120,用于将接收到的输入数据块以及权值和偏置输入到第三乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第四乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另元素值执行求和运算,得到另一乘积结果;累加处理电路121,用于将该乘积结果进行累加运算,得到更新门的输入中间结果(W iz*x t+b iz),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hz*h t-1+b hz);
其中,第三乘法算子、第四乘法算子为部分算子中与更新门对应的算子,W ir、W hz、b ir和b hz为权值和偏置中分别与更新门对应的第一权值、第二权值、第一偏置和第二偏置。
可选的,在得到更新门的输出结果z t时,加法处理电路112,用于将更新门的输入中间结果和输出中间输入到第二加法算子中,对该输入中间结果和输出中间执行求和运算,得到第二求和结果;激活处理电路111,用于将第二求和结果输入到第二激活算子中,对第二求和结果执行sigmoid激活运算,得到更新门的输出结果z t;第二加法算子、第二激活算子为另一部分算子中与更新门对应的算子。
可选的,在得到当前记忆门的输出中间结果时,乘法处理电路120,用于将接收到的输入数据块、权值和偏置输入到第五乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第六乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另元素值执行求和运算,得到另一乘积结果;累加处理电路121,用于将该乘积结果进行累加运算,得到当前记忆门的输入中间结果(W in*x t+b in),将另一乘积结果累加,得到当前记忆门的输出中间结果(W nz*h t-1+b nz);乘法处理电路120,用于将重置门的输出结果r t输入到第一向量乘法算子中,对重置门的输出结果r t与当前记忆门的输出中间结果执行点乘运算,得到第一点乘结果;
其中,第五乘法算子、第六乘法算子、第一向量乘法算子为部分算子中与当前记忆门对应的算子,W in、W hn、b in和b hn为权值和偏置中分别与当前记忆门对应的第一权值、第二权值、第一偏置和第二偏置。
可选的,在得到当前记忆门的输出结果n t时,加法处理电路112,用于将当前记忆门的输入中间结果和第一点乘结果输入到第三加法算子中,对当前记忆门的输入中间结果和点乘结果执行求和运算,得到第三求和结果;激活处理电路111,用于将第三求和结果输入到第三激活算子中,对第三求和结果执行tanh激活运算,得到当前记忆门的输出结果n t
第三加法算子、第三激活算子为另一部分算子中与当前记忆门对应的算子。
可选的,在确定输出层的输出结果时,主处理电路101,用于将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1发送给从处理电路102;
乘法处理电路120,用于将更新门的输出结果z t以及当前记忆门的输出结果n t输入到第二向量乘法算子,对更新门的输出结果z t以及当前记忆门的输出结果n t执行点乘,得到第二点乘结果,将接收到的更新门的输出结果z t以及输出数据h t-1输入到第三向量乘法算子,对更新门的输出结果z t以及输出数据h t-1执行点乘,得到第三点乘结果,将第二点乘结果和第三点乘结果发送给主处理电路101;加法处理电路112,用于将当前记忆门的输出结果n t以及第二点乘结果输入到第一减法算子中,对当前记忆门的输出结果n t以及点乘结果执行减法运算,得到第一差值结果,将第三点乘结果以及第一差值结果输入到第四加法算子,对第三点乘结果以及第一差值结果执行求和,得到输出结果h t
其中,第二向量乘法算子、第三向量乘法算子为部分算子中与输出层对应的算子,第一减法算子、第四减法算子为另一部分算子中与输出层对应的算子。
如图2-6所示,本申请还提供了一种GRU的运算方法,该GRU包括:输入层、隐层、重置门、更新门、当前记忆门和输出层,所述运算方法应用于计算装置,所述运算方法包括:
步骤S601:所述计算装置获取输入层t时刻输入的输入数据x t、前一个GRU的隐层输入的输出数据h t-1和权值。
步骤S602:所述计算装置从预先封装的函数库中调用预先构造的GRU算子。
步骤S603:所述计算装置将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t
可选的,所述将输入数据x t、输出数据h t-1、权值输入到所述预先构造的GRU算子中,得到输出结果h t具体包括:
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t
将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t
将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t
可选的,在从预先封装的函数库中调用预先构造的GRU算子之前,所述方法还包括:
所述计算装置获取偏置。
可选的,所述将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与重置门对应的算子中,得到重置门的输出结果r t具体包括:
获取所述GRU算子中与重置门对应的第一乘法算子、第二乘法算子、第一加法算子以及第一激活算子,所述第一激活算子的激活类型为sigmoid;
将输入数据x t、权值以及偏置输入到所述第一乘法算子中,计算(W ir*x t+b ir),得到第一运算结果,W ir和b ir为权值和偏置中分别与重置门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第二乘法算子中,计算(W hr*h t-1+b hr),得到第二运算结果,W hr和b hr为权值和偏置中分别与重置门对应的第二权值和第二偏置;
将所述第一运算结果和所述第二运算结果输入到所述第一加法算子中求和,得到第一求和结果;
将所述第一求和结果输入到所述第一激活算子中激活,得到重置门的输出结果r t
可选的,所述将输入数据x t、输出数据h t-1、权值输入到所述GRU算子中与更新门对应的算子中,得到更新门的输出结果z t具体包括:
获取所述GRU算子中与更新门对应的第三乘法算子、第四乘法算子、第二加法算子以及第二激活算子,所述第二激活算子的激活类型为sigmoid;
将输入数据x t、权值以及偏置输入到所述第三乘法算子中,计算(W iz*x t+b iz),得到第三运算结果,其中,W ir和b ir为权值和偏置中分别与更新门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第四乘法算子中,计算(W hz*h t-1+b hz),得到第四运算结果,其中,W hz和b hz为权值和偏置中分别与更新门对应的第二权值和第二偏置;
将所述第三运算结果和所述第四运算结果输入到所述第二加法算子中,得到第二求和结果;
将所述第二求和结果输入到所述第二激活算子中激活,得到更新门的输出结果z t
可选的,所述将输入数据x t、输出数据h t-1、权值以及重置门的的输出结果r t输入到所述GRU算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t具体包括:
获取所述GRU算子中与当前记忆门对应的第五乘法算子、第六乘法算子、第一向量乘法算子、第三加法算子、第三激活算子,所述第三激活算子的激活类型为tanh;
将输入数据x t、权值以及偏置输入到所述第五乘法算子,计算(W in*x t+b in),得到第五运算结果,其中,W in和b in为权值和偏置中分别与当前记忆门对应的第一权值和第一偏置;
将输出数据h t-1、权值以及偏置输入到所述第六乘法算子,计算和(W hn*h t-1+b hn),得到第六运算结果,其中,W hn和b hn为权值和偏置中分别与当前记忆门对应的第二权值和第二偏置;
将所述第六运算结果以及重置门的输出结果r t输入到所述第一向量乘法算子,对重置门的输出数据r t与所述第六运算结果进行点乘,得到第一点乘结果;
将所述第一点乘结果与所述第五运算结果输入到所述第三加法算子中求和,得到第三求和结果;
将所述第三求和结果输入到所述第三激活算子激活,得到当前记忆门的输出结果n t
可选的,所述将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1输入到所述GRU算子中与输出层对应的算子中,得到输出结果h t具体包括:
获取所述GRU算子中与输出层对应的第二向量乘法算子、第一减法算子、第三向量乘法算子、第 四加法算子;
将更新门的输出结果z t以及当前记忆门的输出结果n t输入到所述第二向量乘法算子,进行点乘运算,得到第二点乘结果;
将所述当前记忆门的输出结果n t以及所述第二点乘结果输入到所述第一减法算子,执行减法运算,得到第一差值结果;
将更新门的输出结果z t和输出数据h t-1输入到所述第三向量乘法算子,进行点乘运算,得到第三点乘结果;
将所述第一差值结果和所述第三点乘结果输入到所述第四加法算子求和,得到输出结果h t
在一可能的示例中,所述计算装置具体包括:运算单元以及控制器单元;所述运算单元包括:一个主处理电路和从处理电路;所述方法具体包括:
所述控制器单元获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值;
所述控制器单元从预先封装的函数库中调用预先构造的GRU算子;
所述控制器单元将输入数据x t、输出数据h t-1、权值以及GRU算子发送给所述主处理电路;
所述主处理电路将输入数据x t拆分为多个输入数据块、将输出数据h t-1拆分为多个输出数据h t-1,将多个输入数据块、多个输出数据h t-1分发给从处理电路,将权值以及GRU算子中的部分算子广播给从处理电路;从处理电路将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中与重置门对应的算子中,得到重置门的中间结果,将该中间结果发送给主处理电路,主处理电路将该中间结果输入到GRU算子中的另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
从处理电路将接收到的输入数据块、输出数据h t-1、权值输入到部分算子中重置门对应的算子中,得到重置门的中间结果,将该中间结果发送给主处理电路,所述主处理电路将该中间结果输入到另一部分算子中与重置门对应的算子中,得到重置门的输出结果r t
所述主处理电路将重置门的输出结果r t分发给从处理电路;
从处理电路将接收到的输入数据块、输出数据h t-1、权值、输出结果r t输入到部分算子中与当前记忆门对应的算子中,得到当前记忆门的中间结果,将当前记忆门的中间结果发送给主处理电路,所述主处理电路将当前记忆门的中间结果输入到另一部分算子中与当前记忆门对应的算子中,得到当前记忆门的输出结果n t
所述主处理电路将更新门的输出结果z t、当前记忆门的输出结果n t、输出数据h t-1输入到另一部分算子与输出层对应的算子中,得到输出结果h t
可选的,在所述控制器单元获取输入层在t时刻的输入数据x t、前一个GRU的隐层输入的输出数据h t-1、权值时,所述方法还包括:所述控制器单元获取偏置,将偏置发送给所述主处理电路;所述主处理电路将偏置广播给从处理电路。
在上述可能的示例中,在t=0时,即x t为第零时刻的输入数据时,输入的输出数据h -1则为预先设置的一个初始化值,且在GRU为多层GRU时,输入的输出数据h t-1为一个初始化的向量,主处理电路在将输入数据x t拆分为多个输入数据块时,需将该输出数据h t-1拆分为多个输出数据块,并将该多个输出数据块分发到与每层GRU的对应的从处理电路中,以保证计算每层GRU在t=0的输出结果h 0时,接收到的输出数据h -1不同;当t>0时,计算本层的GRU在t时刻的输出时,在接收到上一个GRU的隐层输入的输出数据h t-1后,由于在得到每层GRU的输出结果h t,主处理电路会将该层的输出结果h t输入到整形算子和拆分算子中,得到最终输出结果,故本层GRU接收到的上一个GRU的隐层输入的输出数据h t-1本质上为已经拆分好的多个输出数据块,所以,主处理电路无需对输出数据h t-1进行数据的拆分操作,只需将接收到的输出数据h t-1分发到对应的从处理电路,即可执行本层GRU的运算过程。
可选的,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
所述树型模块转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
可选的,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路;
所述分支处理电路转发所述主处理电路与所述多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果。
可选的,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
所述K个从处理电路在所述主处理电路以及多个从处理电路之间的输入数据块、输出数据h t-1、权值、偏置以及中间结果的转发。
在一可能的示例中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到重置门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块以及权值和偏置输入到第一乘法算子中,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第二乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将所述乘积结果进行累加运算,得到重置门的输入中间结果(W ir*x t+b ir),将另一乘积结果累加,得到重置门的输出中间结果(W hr*h t-1+b hr);
其中,第一乘法算子、第二乘法算子为部分算子中与重置门对应的算子,W ir、W hr、b ir、和b hr为权值和偏置中分别与重置门对应的第一权值、第二权值、第一偏置和第二偏置。
在一可能的示例中,所述主处理电路包括激活处理电路和加法处理电路;所述得到重置门的输出结果r t具体包括:
所述加法处理电路将重置门的输入中间结果和输出中间结果输入到第一加法算子中,对输入中间结果和输出中间结果执行求和运算,得到第一求和结果;
所述激活处理电路将第一求和结果输入到第一激活算子中,对第一求和结果执行sigmoid激活运算,得到重置门的输出结果r t
第一加法算子、第一激活算子为另一部分算子中与重置门对应的算子。
在一可能的示例中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到更新门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块以及权值和偏置输入到第三乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1以及权值和偏置输入到第四乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将该乘积结果进行累加运算,得到更新门的输入中间结果(W iz*x t+b iz),将另一乘积结果进行累加运算,得到重置门的输出中间结果(W hz*h t-1+b hz);
其中,第三乘法算子、第四乘法算子为部分算子中与更新门对应的算子,W ir、W hz、b ir和b hz为权值和偏置中分别与更新门对应的第一权值、第二权值、第一偏置和第二偏置。
在一可能的示例中,所述主处理电路包括激活处理电路和加法处理电路;所述得到更新门的输出结果z t具体包括:
所述加法处理电路将更新门的输入中间结果和输出中间输入到第二加法算子中,对该输入中间结果 和输出中间执行求和运算,得到第二求和结果;
所述激活处理电路将第二求和结果输入到第二激活算子中,对第二求和结果执行sigmoid激活运算,得到更新门的输出结果z t;
第二加法算子、第二激活算子为另一部分算子中与更新门对应的算子。
在一可能的示例中,所述从处理电路包括:乘法处理电路和累加处理电路;所述得到当前记忆门的输出中间结果具体包括:
所述乘法处理电路将接收到的输入数据块、权值和偏置输入到第五乘法算子,对接收到的输入数据块中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的元素值执行求和运算,得到乘积结果;将接收到的输出数据h t-1、权值和偏置输入到第六乘法算子,对接收到的输出数据h t-1中的元素值与权值中对应位置的元素值执行乘积运算,并将乘积结果与偏置中对应位置的另一元素值执行求和运算,得到另一乘积结果;
所述累加处理电路将该乘积结果进行累加运算,得到当前记忆门的输入中间结果(W in*x t+b in),将另一乘积结果进行累加运算,得到当前记忆门的输出中间结果(W nz*h t-1+b nz);
所述乘法处理电路将重置门的输出结果r t输入到第一向量乘法算子中,对重置门的输出结果r t与当前记忆门的输出中间结果执行点乘运算,得到第一点乘结果;
其中,第五乘法算子、第六乘法算子、第一向量乘法算子为部分算子中与当前记忆门对应的算子,W in、W hn、b in和b hn为权值和偏置中分别与当前记忆门对应的第一权值、第二权值、第一偏置和第二偏置。
在一可能的示例中,所述主处理电路包括激活处理电路和加法处理电路;所述得到当前记忆门的输出结果n t具体包括:
所述加法处理电路将当前记忆门的输入中间结果和第一点乘结果输入到第三加法算子中,对当前记忆门的输入中间结果和第一点乘结果执行求和运算,得到第三求和结果;
所述激活处理电路将第三求和结果输入到第三激活算子中,对第三求和结果执行tanh激活运算,得到当前记忆门的输出结果n t
第三加法算子、第三激活算子为另一部分算子中与当前记忆门对应的算子。
在一可能的示例中,所述主处理电路包括加法处理电路,所述从处理电路包括乘法处理电路;所述确定输出层的输出结果具体包括:
所述主处理电路将更新门的输出结果z t、当前记忆门的输出结果n t以及输出数据h t-1发送给从处理电路;
所述乘法处理电路将更新门的输出结果z t以及当前记忆门的输出结果n t输入到第二向量乘法算子,对更新门的输出结果z t以及当前记忆门的输出结果n t执行点乘,得到第二点乘结果,将接收到的更新门的输出结果z t以及输出数据h t-1输入到第三向量乘法算子,对更新门的输出结果z t以及输出数据h t-1执行点乘,得到第三点乘结果,将第二点乘结果和第三点乘结果发送给主处理电路;
所述加法处理电路将当前记忆门的输出结果n t以及第二点乘结果输入到第一减法算子中,对当前记忆门的输出结果n t以及点乘结果执行减法运算,得到第一差值结果,将第三点乘结果以及第一差值结果输入到第四加法算子,对第二点乘结果以及第一差值结果执行求和,得到输出结果h t
其中,第二向量乘法算子、第三向量乘法算子为部分算子中与输出层对应的算子,第一减法算子、第四加法算子为另一部分算子中与输出层对应的算子。
在一可能的示例中,主处理电路包括转换处理电路;
所述转换处理电路将输出结果h t输入到另一部分算子中的整形算子和拆分算子,将输出结果h t的数据格式调整为预设格式,得到最终输出结果。
本申请还揭露了一个GRU装置,其包括一个或多个在本申请中提到的计算装置,用于从其他处理装置中获取待运算数据和控制信息,执行指定的GRU运算,执行结果通过I/O接口传递给外围设备。外围设备譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口,服务器。当包含一个以上计算装置时, 计算装置间可以通过特定的结构进行链接并传输数据,譬如,通过PCIE总线进行互联并传输数据,以支持更大规模的卷积神经网络训练的运算。此时,可以共享同一控制系统,也可以有各自独立的控制系统;可以共享内存,也可以每个加速器有各自的内存。此外,其互联方式可以是任意互联拓扑。
该GRU装置具有较高的兼容性,通过PCIE接口与各种类型的服务器相连接。
本申请还揭露了一个组合处理装置,其包括上述的GRU装置,通用互联接口,和其他处理装置。GRU运算装置与其他处理装置进行交互,共同完成用户指定的操作。图2-7为组合处理装置的示意图。
其他处理装置,包括中央处理器CPU、图形处理器GPU、神经网络处理器等通用/专用处理器中的一种或以上的处理器类型。其他处理装置所包括的处理器数量不做限制。其他处理装置作为GRU运算装置与外部数据和控制的接口,包括数据搬运,完成对本GRU运算装置的开启、停止等基本控制;其他处理装置也可以和GRU运算装置协作共同完成运算任务。
通用互联接口,用于在所述GRU装置与其他处理装置间传输数据和控制指令。该GRU装置从其他处理装置中获取所需的输入数据,写入GRU装置片上的存储装置;可以从其他处理装置中获取控制指令,写入GRU装置片上的控制缓存;也可以读取GRU装置的存储模块中的数据并传输给其他处理装置。
可选的,该结构如图2-8所示,还可以包括存储装置,存储装置分别与所述GRU装置和所述其他处理装置连接。存储装置用于保存在所述GRU装置和所述其他处理装置的数据,尤其适用于所需要运算的数据在本GRU装置或其他处理装置的内部存储中无法全部保存的数据。
该组合处理装置可以作为手机、机器人、无人机、视频监控设备等设备的SOC片上系统,有效降低控制部分的核心面积,提高处理速度,降低整体功耗。此情况时,该组合处理装置的通用互联接口与设备的某些部件相连接。某些部件譬如摄像头,显示器,鼠标,键盘,网卡,wifi接口。
在一些实施例里,还申请了一种芯片,其包括了上述GRU装置或组合处理装置。
在一些实施例里,申请了一种芯片封装结构,其包括了上述芯片。
在一些实施例里,申请了一种板卡,其包括了上述芯片封装结构。参阅图2-9,图2-9提供了一种板卡,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述芯片封装结构内的芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述芯片封装结构内的芯片电连接。所述接口装置用于实现所述芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本申请并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述芯片电连接。所述控制器件用于对所述芯片的状态进行监控。具体的,所述芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
以上对本申请实施例进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (29)

  1. 一种计算装置,其特征在于,所述计算装置用于执行LSTM运算,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;
    所述存储单元,用于存储LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;
    所述控制器单元,用于获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
    所述运算单元,用于依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
  2. 根据权利要求1所述的装置,其特征在于,所述运算单元包括:主处理电路以及从处理电路;
    所述控制器单元,具体用于根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;
    所述主处理电路,具体用于依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;
    所述从处理电路,用于依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;
    所述主处理电路,用于依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结果排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
  3. 根据权利要求2所述的装置,其特征在于,
    所述主处理电路,具体用于依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
  4. 根据权利要求3所述的装置,其特征在于,
    所述主处理电路,具体用于依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
  5. 根据权利要求2所述的装置,其特征在于,所述后续处理具体包括:
    如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;
    如为更新状态门,所述后续处理为激活运算tanh函数。
  6. 根据权利要求2所述的装置,其特征在于,
    所述主处理电路,还用于将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
  7. 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;
    所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
  8. 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,
    所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
  9. 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;
    所述K个从处理电路,用于转发所述主处理电路以及多个从处理电路之间的数据以及算子。
  10. 根据权利要求2-6任意一项所述的装置,其特征在于,所述主处理电路包括:转换处理电路;
    所述转换处理电路,用于对数据执行转换处理,具体为:将主处理电路接收的数据执行第一数据结构与第二数据结构之间的互换。
  11. 根据权利要求2-6所述的装置,其特征在于,所述从处理电路包括:乘法处理电路和累加处理电路;
    所述乘法处理电路,用于对接收到的输入数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的乘积结果;接收到的输入状态数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的另一乘积结果;
    所述累加处理电路,用于对该各个门的乘积结果执行累加运算得到各个门的中间结果,将该各个门的另一乘积结果执行累加运算得到各个门的状态中间结果。
  12. 根据权利要求7所述的装置,其特征在于,所述树型模块为n叉树结构,所述n为大于等于2的整数。
  13. 一种LSTM运算装置,其特征在于,所述LSTM运算装置包括一个或多个如权利要求1-12任一项所述的计算装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的LSTM运算,将执行结果通过I/O接口传递给其他处理装置;
    当所述LSTM装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;
    其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的LSTM的运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
  14. 一种组合处理装置,其特征在于,所述组合处理装置包括如权利要求13所述的LSTM运算装置,通用互联接口和其他处理装置;
    所述LSTM运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。
  15. 根据权利要求14所述的组合处理装置,其特征在于,还包括:存储装置,该存储装置分别与所述LSTM运算装置和所述其他处理装置连接,用于保存所述LSTM运算装置和所述其他处理装置的数据。
  16. 一种神经网络芯片,其特征在于,所述神经网络芯片包括如权利要求1所述的计算装置或如权利要求13所述的LSTM运算装置或如权利要求15所述的组合处理装置。
  17. 一种电子设备,其特征在于,所述电子设备包括如所述权利要求16所述的芯片。
  18. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求16所述的神经网络芯片;
    其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
    所述存储器件,用于存储数据;
    所述接口装置,用于实现所述芯片与外部设备之间的数据传输;
    所述控制器件,用于对所述芯片的状态进行监控。
  19. 根据权利要求18所述的板卡,其特征在于,
    所述存储器件包括:多组存储单元,每一组所述存储单元与所述芯片通过总线连接,所述存储单元为:DDR SDRAM;
    所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
    所述接口装置为:标准PCIE接口。
  20. 一种LSTM运算方法,其特征在于,所述方法应用于计算装置,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;所述存储单元存储:LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;
    所述方法包括如下步骤:
    所述控制器单元获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,
    所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
  21. 根据权利要求20所述的方法,其特征在于,所述运算单元包括:主处理电路以及从处理电路;所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果具体包括:
    所述控制器单元根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;
    所述主处理电路依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;
    所述从处理电路依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;
    所述主处理电路依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
  22. 根据权利要求21所述的方法,其特征在于,依据输入状态值Ct-1以及各个门的输出结果得到输出状态值Ct具体包括:
    所述主处理电路依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
  23. 根据权利要求21所述的方法,其特征在于,所述依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht具体包括:
    所述主处理电路依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
  24. 根据权利要求21所述的方法,其特征在于,所述后续处理具体包括:
    如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;
    如为更新状态门,所述后续处理为激活运算tanh函数。
  25. 根据权利要求21所述的方法,其特征在于,所述方法还包括:
    所述主处理电路将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
  26. 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;所述方法还包括:
    所述树型模块转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
  27. 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,所述方法还包括:
    所述分支处理电路转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
  28. 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;所述方法还包括:
    所述K个从处理电路所述主处理电路以及多个从处理电路之间的数据以及算子。
  29. 根据权利要求20-25所述的方法,其特征在于,所述从处理电路包括:乘法处理电路和累加处理电路;所述方法具体包括:
    所述乘法处理电路对接收到的输入数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的乘积结果;接收到的输入状态数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的另一乘积结果;
    所述累加处理电路对该各个门的乘积结果执行累加运算得到各个门的中间结果,将该各个门的另一乘积结果执行累加运算得到各个门的状态中间结果。
PCT/CN2019/105932 2018-12-20 2019-09-16 计算装置及板卡 WO2020125092A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811560966.5A CN109711540B (zh) 2018-12-20 2018-12-20 一种计算装置及板卡
CN201811560966.5 2018-12-20
CN201811579542.3 2018-12-21
CN201811579542.3A CN109670581B (zh) 2018-12-21 2018-12-21 一种计算装置及板卡

Publications (1)

Publication Number Publication Date
WO2020125092A1 true WO2020125092A1 (zh) 2020-06-25

Family

ID=71100404

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105932 WO2020125092A1 (zh) 2018-12-20 2019-09-16 计算装置及板卡

Country Status (1)

Country Link
WO (1) WO2020125092A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106775599A (zh) * 2017-01-09 2017-05-31 南京工业大学 递归神经网络的多计算单元粗粒度可重构系统及方法
CN107341542A (zh) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 用于执行循环神经网络和lstm运算的装置和方法
US20180005676A1 (en) * 2016-06-30 2018-01-04 Samsung Electronics Co., Ltd. Memory cell unit and recurrent neural network including multiple memory cell units
CN108268939A (zh) * 2016-12-30 2018-07-10 上海寒武纪信息科技有限公司 用于执行lstm神经网络运算的装置和运算方法
CN108446761A (zh) * 2018-03-23 2018-08-24 中国科学院计算技术研究所 一种神经网络加速器及数据处理方法
CN108805273A (zh) * 2018-05-20 2018-11-13 复旦大学 一种lstm中门控单元加速运算的硬件实现电路
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡
CN109711540A (zh) * 2018-12-20 2019-05-03 北京中科寒武纪科技有限公司 一种计算装置及板卡

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107341542A (zh) * 2016-04-29 2017-11-10 北京中科寒武纪科技有限公司 用于执行循环神经网络和lstm运算的装置和方法
US20180005676A1 (en) * 2016-06-30 2018-01-04 Samsung Electronics Co., Ltd. Memory cell unit and recurrent neural network including multiple memory cell units
CN108268939A (zh) * 2016-12-30 2018-07-10 上海寒武纪信息科技有限公司 用于执行lstm神经网络运算的装置和运算方法
CN106775599A (zh) * 2017-01-09 2017-05-31 南京工业大学 递归神经网络的多计算单元粗粒度可重构系统及方法
CN108446761A (zh) * 2018-03-23 2018-08-24 中国科学院计算技术研究所 一种神经网络加速器及数据处理方法
CN108805273A (zh) * 2018-05-20 2018-11-13 复旦大学 一种lstm中门控单元加速运算的硬件实现电路
CN109711540A (zh) * 2018-12-20 2019-05-03 北京中科寒武纪科技有限公司 一种计算装置及板卡
CN109670581A (zh) * 2018-12-21 2019-04-23 北京中科寒武纪科技有限公司 一种计算装置及板卡

Similar Documents

Publication Publication Date Title
CN109522052B (zh) 一种计算装置及板卡
CN109543832B (zh) 一种计算装置及板卡
CN109104876B (zh) 一种运算装置及相关产品
CN111047022B (zh) 一种计算装置及相关产品
CN111488976B (zh) 神经网络计算装置、神经网络计算方法及相关产品
CN109670581B (zh) 一种计算装置及板卡
CN109711540B (zh) 一种计算装置及板卡
CN111488963B (zh) 神经网络计算装置和方法
CN110059797B (zh) 一种计算装置及相关产品
CN111767995B (zh) 运算方法、装置及相关产品
CN111047021B (zh) 一种计算装置及相关产品
CN110059809B (zh) 一种计算装置及相关产品
CN111368967B (zh) 一种神经网络计算装置和方法
CN111368986B (zh) 一种神经网络计算装置和方法
CN111382848B (zh) 一种计算装置及相关产品
CN111381882B (zh) 数据处理装置及相关产品
CN112766475B (zh) 处理部件及人工智能处理器
WO2020125092A1 (zh) 计算装置及板卡
CN111367567B (zh) 一种神经网络计算装置和方法
CN110472734A (zh) 一种计算装置及相关产品
CN111368987B (zh) 一种神经网络计算装置和方法
CN111368990B (zh) 一种神经网络计算装置和方法
CN111258641B (zh) 运算方法、装置及相关产品
CN111291871A (zh) 一种计算装置及相关产品
CN111260046A (zh) 运算方法、装置及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19899228

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19899228

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19899228

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/01/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19899228

Country of ref document: EP

Kind code of ref document: A1