WO2020125092A1 - 计算装置及板卡 - Google Patents
计算装置及板卡 Download PDFInfo
- Publication number
- WO2020125092A1 WO2020125092A1 PCT/CN2019/105932 CN2019105932W WO2020125092A1 WO 2020125092 A1 WO2020125092 A1 WO 2020125092A1 CN 2019105932 W CN2019105932 W CN 2019105932W WO 2020125092 A1 WO2020125092 A1 WO 2020125092A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- gate
- result
- operator
- processing circuit
- output
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- This application relates to the field of neural networks, in particular to a computing device and a board.
- LSTM Long and short time memory network
- RNN time recursive neural network
- LSTM is suitable for processing and predicting important events with very long intervals and delays in time series.
- the LSTM network shows better performance. It is very suitable for learning from experience in order to classify, process and predict the time series when there is an unknown size of time between important events.
- LSTM networks are widely used.
- the existing LSTM network is implemented based on a general-purpose processor, and the existing processor has high energy consumption for performing LSTM operations.
- This application provides a calculation method and related products, which can increase the processing speed of LSTM and save power consumption.
- a computing device for performing an LSTM operation includes: an input gate, a forget gate, an output gate, and an update status gate, and the computing device includes: an arithmetic unit, a controller unit, and a storage unit ;
- the storage unit is used to store the LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
- the controller unit is used to obtain input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and input data Xt, weight data, input state value Ct- 1.
- the input result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
- the arithmetic unit is used to perform input gate operation, forget gate operation, output gate operation and update status gate operation based on input data Xt, weight data, input result ht-1 and LSTM operator to obtain each gate According to the output result of the input state value Ct-1 and the output result of each gate, the output data ht and the output state value Ct are obtained.
- the arithmetic unit includes: a master processing circuit and a slave processing circuit;
- the controller unit is specifically configured to construct a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator based on the LSTM operator;
- the main processing circuit is specifically used to reorder input data Xt, weight data and input status values according to a sorting operator.
- the weight data includes: weight data of each gate, and then separates each according to a split algorithm
- the weight data of the gate and the multiplication operator are broadcast to the slave processing circuit to split the input data and input state value into multiple input data blocks and multiple input state data blocks, and multiple input data blocks and multiple input state data
- the block is distributed to the slave processing circuit;
- the slave processing circuit is configured to perform a multiplication operation on the multiple input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and to convert the multiple input state data blocks according to the multiplication operator Perform multiplication with the weight data of each gate to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
- the main processing circuit is used to sort the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, and perform an offset operation on the sorting result of each gate according to the addition operator to obtain the calculation result of each gate, according to the sorting
- the operator sorts the intermediate result of each state to obtain the state sorting result of each gate, and performs an offset operation on the sorting result of each gate according to the addition operator to obtain the state calculation result of each gate; the operation result of each gate is calculated according to the addition operator And the state calculation results of each gate are added correspondingly and then subjected to subsequent processing to obtain the output result of each gate.
- the main processing circuit is specifically configured to multiply the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and to update the output result gt of the state gate according to the multiplication operator
- the second result is multiplied by the output result it of the input gate, and the first result and the second result are added to obtain the output state value Ct.
- the main processing circuit is specifically configured to perform an activation operation on the output state value Ct according to the activation operator to obtain an activation result, and multiply the output result Ot of the output gate and the activation result to obtain the output result ht.
- the subsequent processing specifically includes:
- the subsequent processing is to activate the tanh function by operation.
- the main processing circuit is further configured to use the output data ht as the input result at the next moment and the output state value Ct as the input state value at the next moment.
- the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and a root port of the tree module Connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;
- the tree module is used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
- the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit,
- the branch processing circuit is used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
- the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to the K slave processing circuits in the plurality of slave processing circuits, the k basic circuits being: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column ;
- the K slave processing circuits are used to forward data and operators between the master processing circuit and the plurality of slave processing circuits.
- the main processing circuit includes: a conversion processing circuit
- the conversion processing circuit is configured to perform conversion processing on the data, specifically: performing interchange between the first data structure and the second data structure on the data received by the main processing circuit.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit;
- the multiplication processing circuit is used to perform a product operation on the element value in the received input data block and the element value in the corresponding position in the weight of each gate to obtain the product result of each gate;
- the element value and the element value of the corresponding position in the weight of each gate perform a product operation to obtain another product result of each gate;
- the accumulation processing circuit is configured to perform an accumulation operation on the product result of each gate to obtain an intermediate result of each gate, and perform an accumulation operation on another product result of each gate to obtain an intermediate result of each gate state.
- the tree module is an n-tree structure, where n is an integer greater than or equal to 2.
- an embodiment of the present application provides an LSTM computing device, the LSTM computing device includes one or more computing devices provided in the first aspect, for obtaining data to be calculated and control information from other processing devices, and Perform the specified LSTM operation, and pass the execution result to other processing devices through the I/O interface;
- the multiple computing devices may be connected and transmit data through a specific structure
- a plurality of the computing devices interconnect and transmit data through a fast external device interconnect bus PCIE bus to support larger-scale LSTM operations; a plurality of the computing devices share the same control system or have their own control systems; A plurality of the computing devices share memory or have their own memories; the interconnection mode of the plurality of computing devices is any interconnection topology.
- a combined processing device including the LSTM computing device of the second aspect, a universal interconnection interface, and other processing devices;
- the LSTM computing device interacts with the other processing device to jointly complete the calculation operation specified by the user.
- a neural network chip in a fourth aspect, includes the computing device provided in the first aspect or the LSTM computing device provided in the second aspect or the combined processing device provided in the third aspect.
- an electronic device in a fifth aspect, includes the chip as provided in the fourth aspect.
- a board card including: a storage device, an interface device and a control device, and a neural network chip provided in the fourth aspect;
- the neural network chip is respectively connected to the storage device, the control device and the interface device;
- the storage device is used to store data
- the interface device is used to realize data transmission between the chip and an external device
- the control device is used for monitoring the state of the chip.
- an embodiment of the present application further provides an LSTM operation method.
- the LSTM includes: the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate, and the computing device includes: an arithmetic unit and a controller Unit, storage unit; the storage unit stores: LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
- the method includes the following steps:
- the controller unit acquires input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and inputs input data Xt, weight data, input state value Ct-1, input The result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
- the arithmetic unit performs input gate operation, forget gate operation, output gate operation, and update status gate operation according to input data Xt, weight data, input result ht-1, and LSTM operator to obtain output results of each gate According to the input state value Ct-1 and the output result of each gate, the output data ht and the output state value Ct are obtained.
- the electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, Cameras, projectors, watches, headphones, mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
- the vehicle includes an airplane, ship, and/or vehicle;
- the household appliance includes a TV, air conditioner, microwave oven, refrigerator, rice cooker, humidifier, washing machine, electric lamp, gas stove, and range hood;
- the medical Equipment includes nuclear magnetic resonance, ultrasound, and/or electrocardiograph.
- an operation method of a gated circulation unit GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer.
- the operation method is applied to a computing device,
- the calculation method includes:
- the computing device obtains the input data x t input at the time of the input layer t, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the computing device calls a pre-constructed GRU operator from a pre-packaged function library
- the computing device inputs input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
- the inputting the input data x t , the output data h t-1 , and the weight value into the pre-constructed GRU operator, and obtaining the output result h t specifically include:
- the method before calling the pre-constructed GRU operator from the pre-packaged function library, the method further includes:
- the computing device obtains the offset.
- the input data x t , the output data h t-1 , and the weight value are input to the operator corresponding to the reset gate in the GRU operator to obtain the output of the reset gate Results r t specifically include:
- the activation type of the first activation operator is sigmoid
- the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
- the input data x t , the output data h t-1 , and the weight value are input to the operator corresponding to the update gate in the GRU operator to obtain the output result z of the update gate t specifically includes:
- the activation type of the second activation operator is sigmoid
- the second summation result is input into the second activation operator for activation, and the output result z t of the update gate is obtained.
- the input data x t , output data h t-1 , weight value, and output result r t of the reset gate are input to the GRU operator corresponding to the current memory gate
- the output result n t of the current memory gate specifically includes:
- the output data h t-1 , the weight and the offset are input to the sixth multiplication operator, and the sum (W hn *h t-1 +b hn ) is calculated to obtain a sixth operation result, where W hn and b hn is the second weight and the second offset corresponding to the current memory gate respectively among the weights and the offset;
- the sixth operation result and the output result r t of the reset gate are input to the first vector multiplication operator, and the output data r t of the reset gate is multiplied by the sixth operation result to obtain the first Dot product
- the third summation result is input to the third activation operator for activation to obtain the current memory gate output result n t .
- the gate of the output update z t, the current output of the memory gate and the output data of n t h t-1 is inputted to the operator GRU in the output layer corresponding to operator
- the output result h t specifically includes:
- the updated gate output and the current output z t n t memory gate input to said second vector multiplication operator performs point multiplication, the result obtained by the second point;
- the first difference result and the third point product result are input to the fourth addition operator for summation to obtain an output result h t .
- the computing device includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit; and the method specifically includes:
- the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the controller unit calls a pre-constructed GRU operator from a pre-packaged function library
- the controller unit sends input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
- the master processing circuit splits the input data x t into multiple input data blocks, distributes the multiple input data blocks and output data h t-1 to the slave processing circuit, and distributes the weights and some of the GRU operators Broadcast to slave processing circuit;
- the main processing circuit From the processing circuit, input the received input data block, output data h t-1 and the weight value into the operator corresponding to the reset gate in the partial operator, obtain the intermediate result of the reset gate, and send the intermediate result to The main processing circuit, the main processing circuit inputs the intermediate result to the operator corresponding to the reset gate in another part of the operators in the GRU operator, to obtain the output result r t of the reset gate;
- the master processing circuit distributes the output result r t of the reset gate to the slave processing circuit
- the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
- the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
- the main processing circuit updates the output gate z t, the current output of the memory gate n t, h t-1 output data is inputted to another portion of the operator and the operator corresponding to the output layer, the output result obtained h t.
- the method when the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight, the method further The method includes: the controller unit obtains the bias, and sends the bias to the master processing circuit; the master processing circuit broadcasts the bias to the slave processing circuit.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and obtaining the intermediate result of the output of the reset gate specifically includes:
- the multiplication processing circuit inputs the received input data block and the weight and offset into the first multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight , And perform the sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the second multiplication operator, and The element value in the output data h t-1 and the element value at the corresponding position in the weight perform a product operation, and the product result is summed with another element value at the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W ir *x t +b ir ) of the reset gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hr *h t-1 +b hr );
- the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit; the obtaining the output result r t of the reset gate specifically includes:
- the addition processing circuit inputs the input intermediate result and the output intermediate result of the reset gate into the first addition operator, and performs a sum operation on the input intermediate result and the output intermediate result to obtain a first summation result;
- the activation processing circuit inputs the first summation result into the first activation operator, performs a sigmoid activation operation on the first summation result, and obtains an output result r t of the reset gate;
- the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the obtaining the intermediate output result of the update gate specifically includes:
- the multiplication processing circuit inputs the received input data block and the weight and offset to the third multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, The product result and the element value of the corresponding position in the offset are summed to obtain the product result; the received output data h t-1 and the weight and offset are input to the fourth multiplication operator. Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the other element value in the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hz *h t-1 +b hz );
- the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
- the main processing circuit includes an activation processing circuit and an addition processing circuit;
- the output result z t of the update gate specifically includes:
- the addition processing circuit inputs the input intermediate result and the output intermediate of the update gate to the second addition operator, performs a sum operation on the input intermediate result and the output intermediate, and obtains a second summation result;
- the activation processing circuit inputs the second summation result into the second activation operator, performs a sigmoid activation operation on the second summation result, and obtains the output result z t of the update gate;
- the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and obtaining the intermediate output result of the current memory gate specifically includes:
- the multiplication processing circuit inputs the received input data block, weight and offset to the fifth multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, And sum the product result with the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 , the weight and the offset into the sixth multiplication operator, and Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the other element value in the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W in *x t +b in ) of the current memory gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the current memory gate ( W nz *h t-1 +b nz );
- the multiplication processing circuit inputs the output result r t of the reset gate into the first vector multiplication operator, and performs a dot product operation on the output result r t of the reset gate and the intermediate result of the output of the current memory gate to obtain the first point Multiply the result
- the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit; the obtaining the output result n t of the current memory gate specifically includes:
- the addition processing circuit inputs the input intermediate result of the current memory gate and the first dot product to the third addition operator, and performs a sum operation on the input intermediate result of the current memory gate and the first dot product to obtain the third Sum the results;
- the activation processing circuit inputs the third summation result into the third activation operator, performs a tanh activation operation on the third summation result, and obtains the output result n t of the current memory gate;
- the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
- the master processing circuit includes an addition processing circuit
- the slave processing circuit includes a multiplication processing circuit
- the output result of the determining output layer specifically includes:
- the main processing circuit updates the output gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
- the multiplication result output circuit outputs the updated gate z t and the current memory is input to the gate of a second n t vector multiplication operator, the update of the output gate and the output current z t n t execution memory door Dot product operation, get the second dot product result, input the output result z t and output data h t-1 of the received update gate into the third vector multiplication operator, and output the output result z t and output data of the update gate h t-1 performs the dot product operation to obtain the third dot product result, and sends the second dot product result and the third dot product result to the main processing circuit;
- the addition processing circuit inputs the current memory gate output result n t and the second point multiplication result into the first subtraction operator, and performs a subtraction operation on the current memory gate output result n t and the second point multiplication result to obtain the first A difference result, the third dot product and the first difference result are input to the fourth addition operator, and the sum operation is performed on the third dot product and the first difference result to obtain the output result h t ;
- the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth addition operator are corresponding to the output layer in another operator operator.
- the main processing circuit includes a conversion processing circuit
- the conversion processing circuit inputs the output result h t into the shaping operator and the split operator in another part of the operators, and adjusts the data format of the output result h t to a preset format to obtain the final output result.
- a computing device is provided, the computing device is used to perform GRU operations, the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer;
- the computing device is used to obtain the input data x t input at the input layer t time, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the computing device is used to call a pre-constructed GRU operator from a pre-packaged function library
- the computing device is configured to input input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
- the computing device includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit;
- the controller unit is used to obtain the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the controller unit is used to call a pre-constructed GRU operator from a pre-packaged function library
- the controller unit is configured to send input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
- the main processing circuit is used for splitting input data x t into multiple input data blocks, splitting output data h t-1 into multiple output data h t-1 , splitting multiple input data blocks and output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit;
- the processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the reset gate in some operators, to obtain the intermediate result corresponding to the reset gate, and The intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result into the operator corresponding to the reset gate in another part of the operators in the GRU operator to obtain the output result r t of the reset gate;
- the main processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the update gate in some operators, to obtain the intermediate result of the update gate, and send the intermediate result To the main processing circuit, the main processing circuit inputs the intermediate result to the operator corresponding to the update gate in another part of the operator, and obtains the output result of the update gate z t
- the master processing circuit is used to distribute the output result r t of the reset gate to the slave processing circuit;
- the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
- the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
- the main processing circuit for updating the gate output z t, the current output of the memory gate n t, h t-1 output data is input to the Operator Operator and another portion corresponding to the output layer, the output obtained h t .
- the controller unit for example, when acquiring the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight, the control The processor unit is also used to obtain an offset and send the offset to the master processing circuit; the master processing circuit is also used to broadcast the offset to the slave processing circuit.
- the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and the tree module The root port of the is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one of the slave processing circuits in the slave processing circuits;
- the tree module is configured to forward the input data block, output data h t-1 , weight, offset, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit;
- the branch processing circuit is configured to forward input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit K slave processing circuits of the plurality of slave processing circuits are connected, and the k basic circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m number in the first column Slave processing circuit
- the K slave processing circuits are used for forwarding input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; if an intermediate result of the output of the reset gate is obtained,
- the multiplication processing circuit is configured to input the received input data block and the weight value and the offset into the first multiplication operator, and compare the element value in the received input data block with the element value in the corresponding position in the weight value Perform a product operation, and perform a sum operation on the product result and the element value at the corresponding position in the offset to obtain a product result; input the received output data h t-1 , weight, and offset to the second multiplication operator, Perform a product operation on the element value in the received output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another Product result;
- the accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain an input intermediate result (W ir *x t +b ir ) of the reset gate, and perform an accumulation operation on another product result to obtain the reset gate Output intermediate results (W hr *h t-1 +b hr );
- the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result r t of the reset gate is obtained,
- the addition processing circuit is configured to input the input intermediate result and the output intermediate result of the reset gate into the first addition operator, perform a sum operation on the input intermediate result and the output intermediate result, and obtain a first summation result;
- the activation processing circuit is configured to input the first summation result into the first activation operator, perform a sigmoid activation operation on the first summation result, and obtain an output result r t of the reset gate;
- the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; when an intermediate result of the output of the update gate is obtained,
- the multiplication processing circuit is configured to input the received input data block, the weight value and the offset to the third multiplication operator, and execute the element value in the received input data block and the element value in the corresponding position in the weight value Product operation, and perform a sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the fourth multiplication operator, The element value in the received output data h t-1 performs the product operation with the element value at the corresponding position in the weight, and performs a sum operation on the product result and another element value at the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and perform an accumulation operation on another product result to obtain an output intermediate of the reset gate Results (W hz *h t-1 +b hz );
- the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
- the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result z t of the update gate is obtained,
- the addition processing circuit is configured to input the input intermediate result and the output intermediate result of the update gate into the second addition operator, and perform a sum operation on the input intermediate result and the output intermediate result to obtain a second summation result;
- the activation processing circuit is configured to input the second summation result into the second activation operator, perform a sigmoid activation operation on the second summation result, and obtain an output result z t of the update gate;
- the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; when obtaining the intermediate output of the current memory gate,
- the multiplication processing circuit is configured to input the received input data block, weight and offset to the fifth multiplication operator, and execute the element value in the received input data block and the element value in the corresponding position in the weight Product operation, and perform a sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the sixth multiplication operator, The element value in the received output data h t-1 performs the product operation with the element value at the corresponding position in the weight, and performs a sum operation on the product result and another element value at the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit is used to accumulate the product result to obtain the input intermediate result (W in *x t +b in ) of the current memory gate, and to accumulate another product result to obtain the output of the current memory gate Intermediate result (W nz *h t-1 +b nz );
- the multiplication processing circuit is configured to input the reset gate output result r t into the first vector multiplication operator, and perform a dot product operation on the reset gate output result r t and the current memory gate output intermediate result to obtain The first point multiplication result;
- the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit; when the output result n t of the current memory gate is obtained,
- the addition processing circuit is configured to input the input intermediate result and the first point multiplication result of the current memory gate into the third addition operator, and perform a sum operation on the input intermediate result and the point multiplication result of the current memory gate to obtain the first Three summation results;
- the activation processing circuit is configured to input the third summation result into the third activation operator, perform a tanh activation operation on the third summation result, and obtain an output result n t of the current memory gate;
- the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
- the master processing circuit includes an addition processing circuit
- the slave processing circuit includes a multiplication processing circuit
- the main processing circuit for updating the output of gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
- the multiplication circuit for outputting the result of the updating door z t and the current output of the memory is input to the gate of the second n T vector multiplication operator, update of the output gate of the output z t and the current memory gate n t perform dot, the dot product to obtain a second result, and outputs the result of received update gate and an output data Z t h t-1 input to the third vector multiplication operator, update the output gate and the output data Z t h t-1 performs the dot product operation to obtain the third dot product result, and sends the second dot product result and the third dot product result to the main processing circuit;
- the addition processing circuit is configured to input the current memory gate output result n t and the second point multiplication result into the first subtraction operator, and perform a subtraction operation on the current memory gate output result n t and point multiplication result to obtain The first difference result, the third dot product and the first difference result are input to the fourth addition operator, and the third dot product and the first difference result are summed to obtain the output result ht;
- the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and fourth addition operator are corresponding to the output layer in another operator operator.
- the main processing circuit includes a conversion processing circuit
- the conversion processing circuit is used for inputting the output result h t into a shaping operator and a split operator in another part of the operator, adjusting the data format of the output result h t to a preset format, and obtaining a final output result.
- a neural network chip is provided, wherein the neural network chip includes the computing device provided in the ninth aspect.
- an electronic device including the chip provided in the tenth aspect.
- Figure 1-1 is a schematic diagram of a LSTM structure
- 1-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
- 1-2a is a schematic structural diagram of an arithmetic unit provided by an embodiment of the present application.
- FIG. 1-3 are schematic structural diagrams of another computing device provided by this application.
- 1-3a is a schematic structural diagram of a main processing circuit provided by the present application.
- Figure 1-4a is a schematic diagram of a structure of a sending end of a tree module provided by this application.
- 1-4b is a schematic structural diagram of a receiving end of a tree module provided by this application.
- Figure 1-4c is a schematic diagram of the binary tree structure provided by this application.
- FIGS. 1-5 are structural diagrams of a computing device provided by an embodiment of the present application.
- Figures 1-6 are schematic flowcharts of the LSTM calculation method provided by an embodiment of the present application.
- FIGS. 1-7 are structural diagrams of a combined processing device provided by an embodiment of the present application.
- FIGS. 1-8 are structural diagrams of another combined processing device provided by an embodiment of the present application.
- 1-9 are schematic structural diagrams of a board provided by an embodiment of the present application.
- Figure 2-1 is a schematic diagram of a GRU structure
- 2-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
- 2-2a is a schematic structural diagram of an arithmetic unit provided by an embodiment of the present application.
- Figure 2-3 is a schematic structural diagram of another computing device provided by the present application.
- Figure 2-3a is a schematic structural diagram of a main processing circuit provided by the present application.
- Figure 2-3b is a schematic diagram of the structure of the slave processing circuit provided by the present application.
- Figure 2-4a is a schematic diagram of the structure of a sending end of a tree module provided by this application.
- Figure 2-4b is a schematic structural diagram of a receiving end of a tree module provided by this application.
- Figure 2-4c is a schematic diagram of the binary tree structure provided by this application.
- FIGS. 2-5 are structural diagrams of a computing device provided by an embodiment of the present application.
- 2-6 are schematic flowcharts of a GRU calculation method provided by an embodiment of the present application.
- FIGS. 2-7 are structural diagrams of a combined processing device provided by an embodiment of the present application.
- FIGS. 2-8 are structural diagrams of another combined processing device provided by an embodiment of the present application.
- 2-9 are schematic structural diagrams of a board provided by an embodiment of the present application.
- FIG. 1-1 is a schematic diagram of an LSTM. As shown in FIG. 1-1, the LSTM includes: an input gate, a forget gate, an update status unit, and an output gate.
- the corresponding calculation formulas are as follows:
- xt is the input data at time t
- ht-1 represents the output data at time t-1
- Wf, Wi, Wg, and Wo represent the weight vectors corresponding to the forget gate, input gate, update status unit, and output gate, respectively.
- Bf, bi, bc and bo represent the offsets corresponding to the forget gate, input gate, update state unit and output gate respectively;
- ft represents the output of the forget gate, which is selectively multiplied by the state unit at time t-1 to select
- Forget the past state unit value it means the output of the input gate, and multiply the obtained candidate state value at time t to selectively add the candidate state value at time t to the update state unit;
- gt means the calculated at time t Candidate state value;
- ct means a new state value obtained by selectively forgetting the state value at time t-1 and selectively adding the state value at time t, ct will be used at the time of calculating the final output and transmitted to the next One moment;
- Ot represents the selection condition that needs to be output as the result part in the state unit at time t;
- ht represents the output at time t, and it will also be transmitted to the next moment (that is, t+1 moment);
- ⁇ is a vector by element The product of operations; ⁇ is the sigmoid function, and the calculation formula is:
- this application puts Wf, Wi, Wg and Wo into a matrix W, and bf, bi, bc and bo into a matrix b.
- FIG. 1-2 which is a computing device provided by this application.
- a computing device is provided for performing LSTM operations.
- the computing device includes: a controller unit 11, an arithmetic unit 12, and a storage unit 10, wherein the controller unit 11 and the arithmetic unit 12 1.
- the storage unit 10 is connected, and the arithmetic unit 12 includes: a master processing circuit 101 and a slave processing circuit 102 (which may be one or more slave processing circuits, preferentially selecting multiple slave processing circuits);
- the above-mentioned main processing circuit itself includes a memory (for example, a memory or a register).
- the memory can store some data of the main processing circuit, and the slave processing circuit can choose to carry the memory.
- LSTM includes: input gate, forget gate, output gate and update status gate;
- the storage unit 10 is used to store the LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct;
- the controller unit 11 is used to obtain input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and input data Xt, weight data, input state value Ct-1 , The input result ht-1, and the LSTM arithmetic operator are sent to the arithmetic unit,
- the arithmetic unit 12 is used to perform input gate operation, forget gate operation, output gate operation, and update status gate operation based on input data Xt, weight data, input result ht-1, and LSTM arithmetic operator For the output result, the output data ht and the output state value Ct are obtained according to the input state value Ct-1 and the output result of each gate.
- the aforementioned controller unit is specifically configured to construct a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator based on the LSTM operator;
- the main processing circuit is specifically used to reorder input data Xt, weight data and input status values according to a sorting operator.
- the weight data includes: weight data of each gate, and then separates each according to a split algorithm
- the weight data of the gate and the multiplication operator are broadcast to the slave processing circuit to split the input data and input state value into multiple input data blocks and multiple input state data blocks, and multiple input data blocks and multiple input state data
- the block is distributed to the slave processing circuit;
- the slave processing circuit is configured to perform a multiplication operation on the multiple input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and to convert the multiple input state data blocks according to the multiplication operator Perform multiplication with the weight data of each gate to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
- each gate in the above gates is relatively independent, and the calculation result is also relatively independent, that is, each gate has its own weight data, such as Wf, Wi, Wg, and Wo represent 4 gates respectively Weight data.
- the foregoing multiplying the multiple input data blocks with the weight data of each gate according to the multiplication operator to obtain the intermediate result of each gate may specifically include:
- the intermediate result of the status of each door described above is similar to the intermediate result of each door, and will not be repeated here.
- the main processing circuit is used to sort the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, and perform an offset operation on the sorting result of each gate according to the addition operator to obtain the calculation result of each gate, according to the sorting
- the operator sorts the intermediate nodes of each state to obtain the state sorting result of each gate, and performs the offset operation to obtain the state calculation result of each gate according to the addition operator to sort the state sorting result of each gate; the operation result of each gate according to the addition operator And the state calculation results of each gate are added correspondingly and then subjected to subsequent processing to obtain the output result of each gate.
- the technical solution provided in this application sets the operation unit to a master-slave structure.
- the input data at this moment and the output data of the forget gate are split and processed in parallel, so that the master processing circuit and the slave processing circuit can Carry out parallel operations on the parts with a large amount of calculation, thereby increasing the operation speed, saving operation time, and reducing power consumption.
- the main processing circuit is specifically configured to multiply the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and to update the output result gt of the state gate according to the multiplication operator
- the second result is multiplied by the output result it of the input gate, and the first result and the second result are added to obtain the output state value Ct.
- the main processing circuit is specifically configured to perform an activation operation on the output state value Ct according to the activation operator to obtain an activation result, and multiply the output result Ot of the output gate and the activation result to obtain the output result ht.
- the subsequent processing specifically includes:
- the subsequent processing is the activation operation tanh function.
- the main processing circuit is further configured to use the output data ht as the input result at the next moment and the output state value Ct as the input state value at the next moment.
- the above LSTM can contain multiple hidden layers, h is an integer greater than or equal to 2, for the hth hidden layer can be any intermediate hidden layer operation in LSTM, multiple LSTM operations, the implementation process is, in the forward operation
- the forward operation is performed at the previous time t-1 and the output result t-1 is obtained
- the operator at the current time t will use the output result t-1 at the previous time as the input data of the forget gate at the next time
- Forget gate uses sigmoid to determine the passing rate of the output result t-1 at the above time, so that the output result t at the time of forget gate t is obtained, the output result t and the weight are calculated, and the other part is calculated as the input of the input layer at time t
- the data is input to the neuron as another part, and then the two parts of the input neuron are respectively multiplied with the weights to obtain two operation results, and the two operation results are added to obtain the output result at time t, and then the output result at time t
- the above computing device may further include: a direct memory access unit 50, and the storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used to store a calculation operator; the register , Used to store the input data and scalar; the cache is a high-speed temporary cache.
- the direct memory access unit 50 is used to read or store data from the storage unit 10.
- the controller unit includes: an operator storage unit 110, an operator processing unit 111, and a storage queue unit 113;
- the operator storage unit 110 is configured to store the calculation operator associated with the LSTM operation
- the operator processing unit 111 is configured to analyze the calculation operator to obtain multiple operation operators
- the storage queue unit 113 is used to store an operator queue.
- the operator queue includes: a plurality of operation operators or a plurality of calculation operators to be executed in the order of the queue.
- controller unit may further include:
- the dependency processing unit 108 is configured to determine whether there is an association relationship between the first operator and the zeroth operator before the first operator when there are multiple operators, such as the first If there is an association relationship between the operation operator and the zeroth operation operator, the first operation operator is cached in the operator storage unit, and after the execution of the zeroth operation operator is completed, the The sub-storage unit extracts the first arithmetic operator and transmits it to the arithmetic unit;
- Said determining whether there is an association relationship between the first operator and the zeroth operator before the first operator includes:
- the zeroth storage address interval of the matrix if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic operator has an association relationship with the zeroth arithmetic operator, If the first storage address interval and the zeroth storage address interval do not overlap, it is determined that the first operator and the zeroth operator do not have an association relationship.
- the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
- multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that ,
- the K slave processing circuits shown in Figure 1-3 include only n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, that is, the k
- the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
- K slave processing circuits for data between the master processing circuit and multiple slave processing circuits (the data may be input data blocks, input state data blocks, intermediate results, state intermediate results, etc.) and operators Forward.
- the main processing circuit may further include: one or any combination of conversion processing circuit 110, activation processing circuit 111, and addition processing circuit 112;
- the conversion processing circuit 110 is used for data conversion processing, specifically: the data received by the main processing circuit (including but not limited to: input data Xt, weight data (weight of each gate), input status value Ct-1,
- the input result ht-1) performs the exchange between the first data structure and the second data structure (for example, conversion of continuous data and discrete data, for example, conversion of floating-point data and fixed-point data).
- the activation processing circuit 111 is used to execute the activation operation of the data in the main processing circuit
- the addition processing circuit 112 is used to perform an addition operation or an accumulation operation.
- the operation operator is a matrix multiplying matrix operator, an accumulation operator, an activation operator, and other calculation operators.
- the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
- the tree module has a sending and receiving function.
- the tree module is a sending function
- the tree module is a receiving function.
- the tree module is used to forward data between the master processing circuit and the plurality of slave processing circuits (the data may be an input data block, an input state data block, an intermediate result, an intermediate state result, etc.).
- the tree-shaped module is a selectable result of the computing device, which may include at least one layer of nodes, the node is a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
- the tree-shaped module may be an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 1-4c, and of course, a trigeminal tree structure, where n may be an integer greater than or equal to 2.
- the specific implementation of the present application does not limit the specific value of the above-mentioned n.
- the above-mentioned number of layers may also be 2.
- the slave processing circuit may be connected to nodes of layers other than the penultimate layer node, for example, as shown in FIG. 1-4c The nodes of the penultimate layer shown.
- the operation unit may carry a separate buffer, as shown in FIG. 1-2a, and may include: a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
- the operation unit may further include: a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
- the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIGS. 1-5, where,
- the branch processing circuit 103 may include a memory. As shown in FIGS. 1-5, the size of the memory of the branch processing circuit 103 may be between 2 and 2.5 times the maximum data capacity to be stored by a single slave processing circuit.
- the slave processing circuit does not need to set a memory. Compared with a branch processing circuit, it only needs to set 2.5*R (the capacity value required by a single slave processor circuit). If there is no branch processing circuit, then it needs to set 4*R, and its The utilization rate of the register is still low, so this structure can effectively reduce the total capacity of the memory and reduce the cost.
- the branch processing circuit is used to forward the data between the master processing circuit and the plurality of slave processing circuits (the data may be an input data block, an input state data block, an intermediate result, an intermediate state result, etc.).
- the intermediate result w is the input data block.
- the w value, i is the column value of the column element calculated with the input data block, and the main processing circuit determines that the position of the intermediate result in the operation result of the corresponding gate is w, i. For example, if the input data block 1,1 and the input intermediate result calculated in the first column of weights are 1,1, the main processing circuit arranges the input intermediate result 1,1 in the first row and first column of the operation result of the corresponding gate.
- the present application also provides an LSTM operation method.
- the method is applied to a computing device.
- the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate.
- the computing device includes an arithmetic unit, a controller unit, and a storage unit. Unit; the storage unit stores: LSTM operator, input data Xt, weight data, output data ht, input state value Ct-1, input result ht-1, output state value Ct; the method includes the following steps:
- Step S601 The controller unit obtains input data Xt, weight data, input state value Ct-1, input result ht-1, and LSTM operator, and converts input data Xt, weight data, input state value Ct-1, The input result ht-1 and the LSTM arithmetic operator are sent to the arithmetic unit,
- Step S602 The arithmetic unit performs input gate operation, forget gate operation, output gate operation, and update status gate operation based on input data Xt, weight data, input result ht-1, and LSTM operator to obtain the output of each gate As a result, the output data ht and the output state value Ct are obtained based on the input state value Ct-1 and the output result of each gate.
- the arithmetic unit includes: a master processing circuit and a slave processing circuit; the arithmetic unit performs input gate operation and forget gate operation according to input data Xt, weight data, input result ht-1 and LSTM arithmetic operator
- the operation, the operation of the output gate, and the operation of updating the state gate to obtain the output result of each gate specifically include:
- the controller unit constructs a plurality of split operators, a plurality of sorting operators, a multiplication operator, an activation operator, and an addition operator according to the LSTM operator;
- the main processing circuit reorders the input data Xt, weight data and input state value according to a sorting operator, the weight data includes: weight data of each gate, and then divides the weight of each gate according to a split algorithm
- the data and multiplication operators are broadcast to the slave processing circuit, split the input data and input state value into multiple input data blocks and multiple input state data blocks, and distribute multiple input data blocks and multiple input state data blocks to all Describe the processing circuit;
- the slave processing circuit multiplies the input data blocks and the weight data of each gate according to a multiplication operator to obtain an intermediate result of each gate, and converts the plurality of input state data blocks and each gate according to the multiplication operator Perform the multiplication operation on the weight data to obtain the intermediate result of each gate, and send the intermediate result of each gate and the intermediate result of each gate to the main processing circuit;
- the main processing circuit sorts the intermediate result of each gate according to the sorting operator to obtain the sorting result of each gate, performs an offset operation on the sorting result of each gate according to the addition operator to obtain the operation result of each gate, and according to the sorting operator will
- the intermediate node of each state is sorted to obtain the state sorting result of each gate, and the offset operation of the state sorting result of each gate is performed according to the addition operator to obtain the state operation result of each gate; the operation result of each gate and each gate are calculated according to the addition operator
- the result of the state operation corresponds to the subsequent processing after addition to obtain the output result of each gate.
- the output state value Ct according to the input state value Ct-1 and the output result of each gate specifically includes:
- the main processing circuit multiplies the input state value Ct-1 by the output result ft of the forget gate according to the multiplication operator to obtain the first result, and updates the output result gt of the state gate to the output result it of the input gate according to the multiplication operator
- the second result is obtained by multiplication, and the first result and the second result are added to obtain the output state value Ct.
- the output data ht obtained according to the input state value Ct-1 and the output result of each gate specifically includes:
- the main processing circuit performs an activation operation on the output state value Ct according to an activation operator to obtain an activation result, and multiplies the output result Ot of the output gate by the activation result to obtain an output result ht.
- This application also discloses an LSTM device, which includes one or more computing devices mentioned in this application, used to obtain data to be calculated and control information from other processing devices, perform a specified LSTM operation, and the execution result passes I
- the /O interface is passed to peripheral devices.
- Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
- the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support the operation of larger-scale convolutional neural network training.
- the interconnection method can be any interconnection topology.
- the LSTM device has high compatibility and can be connected to various types of servers through the PCIE interface.
- the present application also discloses a combined processing device, which includes the above-mentioned LSTM device, universal interconnection interface, and other processing devices.
- the LSTM computing device interacts with other processing devices to complete the operation specified by the user.
- Figures 1-7 are schematic diagrams of combined processing devices.
- processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
- the number of processors included in other processing devices is not limited.
- Other processing devices serve as the interface between the LSTM computing device and external data and control, including data handling, and complete the basic control of starting and stopping the LSTM computing device; other processing devices can also cooperate with the LSTM computing device to complete the computing task.
- the universal interconnection interface is used to transfer data and control operators between the LSTM device and other processing devices.
- the LSTM device obtains the required input data from other processing devices and writes to the storage device on the LSTM device chip; it can obtain control operators from other processing devices and write to the control buffer on the LSTM device chip; it can also read the LSTM device
- the data in the storage module is transferred to other processing devices.
- the structure may further include a storage device, which is respectively connected to the LSTM device and the other processing device.
- the storage device is used to store data stored in the LSTM device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the LSTM device or other processing device.
- the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
- the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
- a chip is also applied, which includes the above-mentioned LSTM device or combination processing device.
- a chip packaging structure is applied, which includes the above chip.
- a board card is applied, which includes the above chip packaging structure.
- FIG. 1-9 provides a board card.
- the board card may also include other supporting components.
- the supporting components include but are not limited to: a storage device 390 and an interface device 391. And control device 392;
- the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
- the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
- DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
- the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
- each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transfer data twice in one clock cycle.
- a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
- the interface device is electrically connected to the chip in the chip packaging structure.
- the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
- the interface device may also be other interfaces.
- the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
- the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
- the control device is electrically connected to the chip.
- the control device is used to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a microcontroller (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
- the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
- an electronic device is applied, which includes the above-mentioned board.
- Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and/or medical devices.
- the vehicles include airplanes, ships, and/or vehicles;
- the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
- the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and/or electrocardiograph.
- FIG. 2-1 is a schematic diagram of a GRU provided by an embodiment of the present application ,
- the GRU includes: input layer, hidden layer, reset gate, update gate, current memory gate and output layer. Connect the door, update the door and the current memory door.
- the hidden layer of the previous GRU unit is connected to the reset door, update door, current memory door and output layer of the current GRU unit.
- GRU is LSTM (Long Short-term Memory Network Term Memory)
- the output result of the reset gate in Figure 2-1 z t is used to control the degree to which the state information of the previous moment is brought into the current state
- the output result of the reset gate r t is used
- FIG. 2-2 is a computing device provided by an embodiment of the present application.
- the computing device is used to perform GRU operations.
- the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, and a current memory Gate and output layer;
- the computing device is used to obtain the input data x t input at the input layer t time, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the computing device is used to call a pre-constructed GRU operator from a pre-packaged function library
- the computing device is configured to input input data x t , output data h t-1 , and weights into the pre-constructed GRU operator to obtain an output result h t .
- the technical solution provided by this application pre-compiles the GRU calculation process into corresponding operators, so that the GRU operation is performed on the MLU without the need for CPU access to instruction decoding and data memory, which improves the GRU operation speed and improves operation efficiency.
- the computing device when inputting the input data x t , the output data h t-1 and the weight value into the pre-constructed GRU operator, and obtaining the output result h t is specifically used for:
- the input output data h -1 is a preset initial value
- the GRU is a multi-layer GRU
- the input The output data h t-1 is an initialized vector.
- the main processing circuit will input the output result h t of this layer into the shaping operator and the split operator to get the final output result, so the output data h t-1 of the hidden layer input of the previous GRU received by the GRU of this layer is essentially has a good split of the plurality of output data blocks, therefore, the main data processing circuit without an output data h t-1 split operation, only the received output data h t-1 from the processing corresponding to the distribution Circuit, you can perform the calculation process of this layer of GRU.
- the operator is a mapping from one function space to another function space.
- the machine learning processor MLU is applied to machine learning operations, where machine learning operations include neural network operations, k-means operations, support vector machine operations, etc.
- the machine learning processor MLU may specifically include NPU (Neural-Network Processing Unit, Neural network processor unit), DSP (Digital Signal Process, digital signal processing), one or a combination of field programmable gate array (Field-Programmable Gate Array, FPGA) chips.
- NPU Neural-Network Processing Unit
- DSP Digital Signal Process, digital signal processing
- FPGA Field-Programmable Gate Array
- each operation process of the GRU is compiled into its corresponding operator in advance. If multiple operators are obtained, the compiler will be compiled. A number of good operators are pre-packaged in the function library.
- the corresponding GRU operator is retrieved from the pre-packaged function library through the function interface, and the input data is input into the retrieved GRU operator, The operation process corresponding to the GRU operator is executed, and the output result is obtained.
- GRU the operation of GRU is as follows:
- r t sigmoid(W ir *x t +b ir +W hr *h t-1 +b hr );
- n t tanh(W in *x t +b in +r t ⁇ (W hn *h t-1 +b hn ));
- h t (1-z t ) ⁇ n t +z t ⁇ h t-1.
- x t is the input data at time t
- h t-1 is the output data of the hidden layer input of the previous GRU
- r t is the output of the reset gate
- z t is the output of the update gate
- n t is the current memory gate
- H t represents the output at time t
- W r , W z and W n represent the weights corresponding to the reset gate, update gate, and current memory gate
- b r , b z, and b n represent the reset, respectively
- the offsets corresponding to the gate, update gate, and current memory gate, W ir , W hr , b ir , and b hr are the first weight, the second weight, the first offset, and the second offset corresponding to the reset gate, respectively
- W iz , W hz , b iz , b hz are the first weight, the second weight, the first offset, the second offset corresponding to the update gate
- each step of the GRU operation process is implemented by constructing operators in this application, if vector splicing is performed, each calculation is called When the operator performs the operation, it is necessary to split the stitched weights and offsets to obtain the weights and offsets required by each operator.
- the invalid stitching and splitting process is performed, which affects the speed of the operation, so this application is acquiring
- the weights and offsets are split into the weights and offset blocks corresponding to the reset gate, update gate and current memory gate in advance, and added to each weight and offset block after the respective gates, the input data and output data h t h t-1 corresponding to the identification information, in the calculation of the output of each gate to a query based on the identification information weights and corresponding to the gate bias, corresponding directly with
- the input data and output data are calculated to ensure that the GRU operation is performed on the MLU, while the GRU operation speed is increased and the operation efficiency is improved.
- construct the operator corresponding to the reset gate specifically: construct the first multiplication operator (W ir *x t +b ir ), the second multiplication operator (W hr *h t-1 +b hr ).
- the first addition operator is used to sum the output results of the first multiplication operator and the second multiplication operator.
- the first activation operator is used to activate the output result of the first addition operator to obtain a reset gate Output r t , the activation type of the first activation operator is sigmoid; construct the operator corresponding to the update gate, specifically: construct the third multiplication operator W iz *x t +b iz , the fourth multiplication operator W hz *h t-1 +b hz , the second addition operator, used to sum the output of the third multiplication operator and the fourth multiplication operator, the second activation operator, used to activate the second addition operator
- the output result is the output z t of the update gate.
- the activation type of the second activation operator is sigmoid; construct the operator corresponding to the current memory gate, specifically: construct the fifth multiplication operator W in *x t +b in and The sixth multiplication operator W hn *h t-1 +b hn , the first vector multiplication operator r t ⁇ (W hn *h t-1 +b hn ), which is used to output the result of the sixth multiplication operator Perform dot product with r t , the third addition operator, used to sum the output of the fifth multiplication operator and the first vector multiplication operator, and the third activation operator, used to activate the output of the third addition operator As a result, the output result n t of the current memory gate is obtained, and the activation type of the third activation operator is tanh; the operator corresponding to the output layer is constructed, specifically: the second vector multiplication operator is constructed and executed on z t and n t Dot multiplication, calculate z t ⁇ n t, the first subtraction operator, used
- the computing device before calling the pre-constructed GRU operator from the pre-packaged function library, the computing device is also used to obtain the offset.
- the calculation device is specifically configured to: acquire the first multiplication operator, the second multiplication operator, and the first corresponding to the reset gate in the GRU operator An addition operator and a first activation operator, the activation type of the first activation operator is sigmoid; input the input data x t , weight and offset into the first multiplication operator, and calculate (W ir * x t +b ir ) to obtain the first operation result, where W ir and b ir are the first weight and the first offset corresponding to the reset gate in the weight and the offset respectively; the output data h t- 1.
- the weight and offset are input into the second multiplication operator, and (W hr *h t-1 +b hr ) is calculated to obtain a second operation result, where W hr and b hr are the weight and the offset Centering the second weight value and the second offset corresponding to the reset gate respectively; inputting the first operation result and the second operation result into the first addition operator for summation to obtain the first And the result; the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
- the calculation device is specifically configured to: obtain a third multiplication operator, a fourth multiplication operator, and a second addition operation corresponding to the update gate in the GRU operator And the second activation operator, the activation type of the second activation operator is sigmoid; input the input data x t , weight and offset into the third multiplication operator, and calculate (W iz *x t +b iz ) to get the third operation result, where W ir and b ir are the first weight and the first offset corresponding to the update gate in the weight and offset respectively; output the data h t-1 and the weight The value and the offset are input into the fourth multiplication operator, and (W hz *h t-1 +b hz ) is calculated to obtain a fourth operation result, where W hz and b hz are the weight and offset respectively A second weight value and a second offset corresponding to the update gate; input the third operation result and the fourth operation result into the second addition operator
- the computing device is specifically configured to: obtain the fifth multiplication operator and the sixth multiplication operator corresponding to the current memory gate in the GRU operator, A first vector multiplication operator, a third addition operator, and a third activation operator, the activation type of the third activation operator is tanh; input data x t , weight, and offset are input to the fifth multiplication Operator, calculate (W in* x t +b in ) to get the fifth operation result, where W in and b in are the first weight and the first offset corresponding to the current memory gate in the weight and offset respectively Set; input the output data h t-1 , weight and offset to the sixth multiplication operator, calculate (W hn *h t-1 +b hn ), get the sixth operation result, where, W hn and b hn is the second weight and the second offset respectively corresponding to the current memory gate among the weights and the offset; the sixth calculation result and the output result r t of
- the computing device is specifically configured to: obtain a second vector multiplication operator, a first subtraction operator, a third vector multiplication operator, and a fourth addition operator corresponding to the output layer in the GRU operator ; output gate outputs the updated and the current memory z t n t is input to the gate of the second vector multiplication operator performs point multiplication, the result obtained by the second point; the output of the current memory door n t and the second point multiplication result are input to the first subtraction operator, and a subtraction operation is performed to obtain a first difference result; the output result z t of the update gate and the output data h t-1 are input to the A third vector multiplication operator, performing a dot product operation to obtain a third dot product result; inputting the first difference result and the third dot product result to the fourth addition operator for summation to obtain an output result h t .
- the foregoing computing device specifically includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and at least one slave processing circuit;
- the controller unit is used to obtain the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the controller unit is used to call a pre-constructed GRU operator from a pre-packaged function library
- the controller unit is configured to send input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
- the main processing circuit is used for splitting input data x t into multiple input data blocks, splitting output data h t-1 into multiple output data h t-1 , splitting multiple input data blocks and output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit;
- the processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the reset gate in some operators, to obtain the intermediate result corresponding to the reset gate, and The intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result into the operator corresponding to the reset gate in another part of the operators in the GRU operator to obtain the output result r t of the reset gate;
- the main processing circuit From the processing circuit, it is used to input the received input data block, output data h t-1 and weight value into the operator corresponding to the update gate in some operators, to obtain the intermediate result of the update gate, and send the intermediate result To the main processing circuit, the main processing circuit inputs the intermediate result into the operator corresponding to the update gate in another part of the operator, and obtains the output result r t of the reset gate
- the master processing circuit is used to distribute the output result r t of the reset gate to the slave processing circuit;
- the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
- the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
- the main processing circuit for updating the gate output z t, the current output of the memory gate n t, h t-1 output data is input to the Operator Operator and another portion corresponding to the output layer, the output obtained h t .
- the foregoing computing device may further include: a storage unit 10 and a direct memory access unit 50.
- the storage unit 10 may include: one or any combination of registers and caches. Specifically, the cache is used to store calculation instructions; The register is used to store the input data and scalar; the cache is a high-speed temporary storage cache.
- the direct memory access unit 50 is used to read or store data from the storage unit 10.
- the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
- the instruction storage unit 110 is used to store GRU operators associated with GRU operations
- the instruction processing unit 111 is configured to parse the GRU operator to obtain multiple GRU operators;
- the storage queue unit 113 is used to store an instruction queue including multiple GRU operators to be executed in the order of the queue.
- the above register may be an off-chip memory. Of course, in actual application, it may also be an on-chip memory for storing data.
- the data may specifically be multi-dimensional (more than 2 dimensions) data.
- controller unit may further include:
- the dependency processing unit 108 is configured to determine whether there is an association relationship between the first GRU operator and the zeroth GRU operator before the first GRU operator when there are multiple GRU operators, such as the first
- the GRU operator is associated with the zeroth GRU operator, the first GRU operator is cached in the instruction storage unit, and after the execution of the zeroth GRU operator is completed, the instruction is stored from the instruction
- the unit extracts the first GRU operator and transmits it to the arithmetic unit;
- the determining whether the first GRU operator is associated with the zeroth operation instruction before the first GRU operator includes:
- Extract the first storage address interval of the data (such as a matrix) required in the first GRU operator according to the first GRU operator, and extract the required in the zeroth GRU operator according to the zeroth GRU operator
- the zeroth storage address interval of the matrix if the first storage address interval overlaps with the zeroth storage address interval, it is determined that the first GRU operator and the zeroth GRU operator have an association relationship, If the first storage address interval and the zeroth storage address interval do not overlap, it is determined that the first GRU operator and the zeroth GRU operator do not have an association relationship.
- the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
- multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that ,
- the K slave processing circuits shown in Figure 2-3 include only n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, that is, the k
- the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
- K slave processing circuits are used for forwarding of input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the main processing circuit 101 may further include one or any combination of a conversion processing circuit 110, an activation processing circuit 111, and an addition processing circuit 112;
- the conversion processing circuit 110 is used to perform conversion processing on the data, specifically: before performing the GRU operation, the conversion processing circuit 110 is specifically used to: obtain the shaping operator and the split operator received by the main processing circuit 101, and convert the main processing circuit 101
- the received input data x t , output data h t-1 weight and offset are adjusted to the preset four-dimensional tensor format, that is, the exchange between the first data structure and the second data structure is performed (for example, continuous data and The conversion of discrete data); when the output result h t is obtained, the output result h t is input to the shaping operator and the split operator in another part of the operator, and the output result h t is adjusted to a preset format (ie four-dimensional Zhang Quantity format) to get the final output result.
- a preset format ie four-dimensional Zhang Quantity format
- the activation processing circuit 111 is used to execute the activation operation of the data in the main processing circuit
- the addition processing circuit 112 is used to perform an addition operation or an accumulation operation.
- the slave processing circuit 102 may further include: one or any combination of the multiplication processing circuit 120 and the accumulation processing circuit 121;
- the multiplication processing circuit 120 is used to perform multiplication operations of data from the processing circuit, such as vector and vector dot multiplication, matrix and matrix dot multiplication, matrix and matrix convolution, matrix and vector convolution, etc. Wait;
- the accumulation processing circuit 121 is used to perform an accumulation operation.
- the calculation instruction to be executed in the GRU operator is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
- the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
- the above tree module has a sending and receiving function, as shown in Figure 2-4a, the tree module is the sending function, and as shown in Figure 2-4b, the tree module is the receiving function.
- the tree module is configured to forward the input data block, output data h t-1 , weight, offset, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the tree-shaped module is a selectable result of the computing device, which may include at least one layer of nodes, the node is a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
- the tree-shaped module may have an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 2-4c, and of course, a trigeminal tree structure, where n may be an integer greater than or equal to 2.
- the specific implementation of the present application does not limit the specific value of n, and the number of layers may also be 2.
- the slave processing circuit may be connected to nodes other than the penultimate layer node, for example, as shown in FIG. 2-4c The nodes of the penultimate layer shown.
- the arithmetic unit 12 may carry a separate buffer, as shown in FIG. 2-2a, and may include a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
- the operation unit may further include: a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
- the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIGS. 2-5, where,
- the branch processing circuit 103 may include a memory. As shown in FIG. 2-5, the size of the memory of the branch processing circuit 103 may be between 2 and 2.5 times the maximum data capacity that a single slave processing circuit needs to store. The slave processing circuit does not need to set a memory. Compared with a branch processing circuit, it only needs to set 2.5*R (the capacity value required by a single slave processor circuit). If there is no branch processing circuit, then it needs to set 4*R, and its The utilization rate of the register is still low, so this structure can effectively reduce the total capacity of the memory and reduce the cost.
- 2.5*R the capacity value required by a single slave processor circuit
- the branch processing circuit is configured to forward input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- w the input data block
- the main processing circuit determines that the position of the intermediate result output result in the hidden layer is w, i. For example, input data block input data block 1,1 and input intermediate result 1,1 calculated by the first column of weights, the main processing circuit arranges the input intermediate result 1,1 in the first row and first column of the hidden layer output result.
- the multiplication processing circuit 120 is used to input the received input data block and the weight and offset into the first multiplication operator, and to the received input data Perform the product operation on the element value in the block and the element value in the corresponding position in the weight, and perform the sum operation on the product result and the element value in the corresponding position in the offset to obtain the product result;
- the received output data h t-1 And weights and offsets are input to the second multiplication operator, the product value of the element value in the received output data h t-1 and the corresponding position in the weights is performed, and the product result corresponds to the offset Perform sum operation on the other element value of the position to obtain another product result;
- the accumulation processing circuit 121 is used to accumulate the product result to obtain the intermediate input result of the reset gate (W ir *x t +b ir ) ,Accumulate another product result to get the output intermediate result of reset gate (W hr *h t-1 +b hr );
- the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
- the addition processing circuit 112 is used to input the input intermediate result and the output intermediate result of the reset gate to the first addition operator, and to input the intermediate result and the output Perform the summation operation on the intermediate result to obtain the first summation result;
- the activation processing circuit 111 is used to input the first summation result into the first activation operator, perform a sigmoid activation operation on the first summation result, and get a reset The output r t of the gate;
- the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
- the multiplication processing circuit 120 is used to input the received input data block and the weight and offset to the third multiplication operator, and to the received input data block Perform the product operation on the element value of the corresponding position in the weight and the element value of the corresponding position in the weight, and perform the sum operation on the element value of the corresponding position in the offset to obtain the product result;
- the received output data h t-1 and the weight The value and the offset are input to the fourth multiplication operator, and the product value of the element value in the received output data h t-1 and the corresponding position in the weight is calculated, and the product result is compared with the corresponding position in the offset
- Another element value performs a sum operation to obtain another product result;
- the accumulation processing circuit 121 is used to accumulate the product result to obtain the input intermediate result of the update gate (W iz *x t +b iz ), and then another The product result is accumulated, and the intermediate output of the reset gate is obtained (W hz *h
- the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
- the addition processing circuit 112 is configured to input the input intermediate result and the output intermediate of the update gate into the second addition operator, and execute the input intermediate result and the output intermediate
- the summation operation obtains the second summation result
- the activation processing circuit 111 is used to input the second summation result into the second activation operator, perform a sigmoid activation operation on the second summation result, and obtain the output result of the update gate z t
- the second addition operator and the second activation operator are the operators corresponding to the update gate in another part of the operators.
- the multiplication processing circuit 120 is used to input the received input data block, weight and offset to the fifth multiplication operator, and to the received input data block Perform the product operation on the element value in the weight and the element value in the corresponding position in the weight value, and perform the sum operation on the product value and the element value in the corresponding position in the offset to obtain the product result;
- the received output data h t-1 and The weight and offset are input into the sixth multiplication operator, and the product value of the element value in the received output data h t-1 and the corresponding position in the weight is calculated, and the product result is in the corresponding position in the offset Perform the sum operation on the other element value of to obtain another product result;
- the accumulation processing circuit 121 is used to accumulate the product result to obtain the input intermediate result (W in *x t +b in ) of the current memory gate, Another product result is accumulated to obtain the output intermediate result of the current memory gate (W nz *h t-1 +b nz );
- the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
- the addition processing circuit 112 is used to input the intermediate input result of the current memory gate and the first point multiplication result into the third addition operator, for the current memory gate
- the input intermediate result and the dot product result perform a sum operation to obtain a third sum result
- an activation processing circuit 111 is used to input the third sum result into the third activation operator, and perform tanh on the third sum result Activate the operation and get the output result n t of the current memory gate;
- the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
- the multiplication result output processing circuit 120 for updating the gate current and the output z t n t memory gate input to the second vector multiplication operator, update of the output gate of the output z t and the current memory gate n t Perform a dot product to obtain a second dot product result, input the received update gate output result z t and output data h t-1 into the third vector multiplication operator, and update the gate output result z t and output data h t-1 performs dot multiplication to obtain the third dot multiplication result, and sends the second dot multiplication result and the third dot multiplication result to the main processing circuit 101; the addition processing circuit 112 is used to output the current memory gate output result n t and The second point multiplication result is input into the first subtraction operator, and a subtraction operation is performed on the output result n t of the current memory gate and the point multiplication result to obtain the first difference result, and the third point multiplication result and the first difference result Input to the fourth addition operator, perform summation on the
- the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth subtraction operator are corresponding to the output layer in another operator operator.
- the present application also provides a GRU calculation method.
- the GRU includes: an input layer, a hidden layer, a reset gate, an update gate, a current memory gate, and an output layer.
- the calculation method is applied to Computing device, the calculation method includes:
- Step S601 The computing device obtains the input data x t input at the time of the input layer t, the output data h t-1 of the hidden layer input of the previous GRU, and the weight value.
- Step S602 The computing device calls a pre-constructed GRU operator from a pre-packaged function library.
- Step S603 The computing device inputs the input data x t , the output data h t-1 , and the weight value into the pre-constructed GRU operator to obtain an output result h t .
- the input data x t , output data h t-1 , and weights are input into the pre-constructed GRU operator, and the output result h t specifically includes:
- the method before calling the pre-constructed GRU operator from the pre-packaged function library, the method further includes:
- the computing device obtains the offset.
- the input data x t , output data h t-1 , and weights are input to the operator corresponding to the reset gate in the GRU operator, and the output result r t of the reset gate specifically includes :
- the activation type of the first activation operator is sigmoid
- the first summation result is input into the first activation operator for activation, and the output result r t of the reset gate is obtained.
- the input data x t , output data h t-1 , and the weight value are input to the operator corresponding to the update gate in the GRU operator, and the output result z t of the update gate specifically includes:
- the activation type of the second activation operator is sigmoid
- the second summation result is input into the second activation operator for activation, and the output result z t of the update gate is obtained.
- the input data x t , the output data h t-1 , the weight value, and the output result r t of the reset gate are input to the operator corresponding to the current memory gate in the GRU operator to obtain
- the output of the current memory gate n t specifically includes:
- the output data h t-1 , the weight and the offset are input to the sixth multiplication operator, and the sum (W hn *h t-1 +b hn ) is calculated to obtain a sixth operation result, where W hn and b hn is the second weight and the second offset corresponding to the current memory gate respectively among the weights and the offset;
- the sixth operation result and the output result r t of the reset gate are input to the first vector multiplication operator, and the output data r t of the reset gate is multiplied by the sixth operation result to obtain the first Dot product
- the third summation result is input to the third activation operator for activation to obtain the current memory gate output result n t .
- the gate of the output update z t, the current output of the memory gate and the output data of n t h t-1 is input to the operator and the operator output layer corresponding to the GRU obtain output h t specifically includes:
- the updated gate output and the current output z t n t memory gate input to said second vector multiplication operator performs point multiplication, the result obtained by the second point;
- the first difference result and the third point product result are input to the fourth addition operator for summation to obtain an output result h t .
- the computing device specifically includes: an arithmetic unit and a controller unit; the arithmetic unit includes: a master processing circuit and a slave processing circuit; and the method specifically includes:
- the controller unit obtains the input data x t of the input layer at time t , the output data h t-1 of the hidden layer input of the previous GRU, and the weight value;
- the controller unit calls a pre-constructed GRU operator from a pre-packaged function library
- the controller unit sends input data x t , output data h t-1 , weight and GRU operator to the main processing circuit;
- the main processing circuit splits the input data x t into multiple input data blocks, splits the output data h t-1 into multiple output data h t-1 , and splits the multiple input data blocks and multiple output data h t-1 is distributed to the slave processing circuit, and the weights and some operators in the GRU operator are broadcast to the slave processing circuit; the slave processing circuit inputs the received input data block, output data h t-1 and the weights to the part Among the operators corresponding to the reset gate, the intermediate result of the reset gate is obtained, and the intermediate result is sent to the main processing circuit, and the main processing circuit inputs the intermediate result to another part of the operator in the GRU operator In the operator corresponding to the reset gate, the output result r t of the reset gate is obtained;
- the master processing circuit distributes the output result r t of the reset gate to the slave processing circuit
- the processing circuit From the processing circuit, input the received input data block, output data h t-1 , weight value, and output result r t into the operator corresponding to the current memory gate in some operators, and obtain the intermediate result of the current memory gate.
- the intermediate result of the current memory gate is sent to the main processing circuit, and the main processing circuit inputs the intermediate result of the current memory gate to the operator corresponding to the current memory gate in another part of the operators to obtain the output result of the current memory gate n t ;
- the main processing circuit updates the output gate z t, the current output of the memory gate n t, h t-1 output data is inputted to another portion of the operator and the operator corresponding to the output layer, the output result obtained h t.
- the method further includes: the control The processor unit obtains the offset and sends the offset to the master processing circuit; the master processing circuit broadcasts the offset to the slave processing circuit.
- the input output data h -1 is a preset initialization value
- the GRU is a multi-layer GRU
- the input output data h t-1 is an initialized vector.
- the operation unit includes: a tree module, and the tree module includes: a root port and multiple branch ports, and a root port of the tree module Connected to the main processing circuit, and the plurality of branch ports of the tree module are respectively connected to one of the plurality of slave processing circuits;
- the tree module forwards input data blocks, output data h t-1 , weights, offsets, and intermediate results between the master processing circuit and the plurality of slave processing circuits.
- the arithmetic unit further includes one or more branch processing circuits, and each branch processing circuit is connected to at least one slave processing circuit;
- the branch processing circuit forwards the input data block, output data h t-1 , weight, offset, and intermediate result between the master processing circuit and the plurality of slave processing circuits.
- the multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to the K slave processing circuits in the plurality of slave processing circuits, the k basic circuits being: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column ;
- the input data blocks, output data h t-1 , weights, offsets, and intermediate results of the K slave processing circuits between the master processing circuit and the plurality of slave processing circuits are forwarded.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the obtaining the intermediate output of the reset gate specifically includes:
- the multiplication processing circuit inputs the received input data block and the weight and offset into the first multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight , And perform the sum operation on the product result and the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 and the weight and offset into the second multiplication operator, and The element value in the output data h t-1 and the element value at the corresponding position in the weight perform a product operation, and the product result is summed with another element value at the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the reset gate (W ir *x t +b ir ), and accumulates another product result to obtain an output intermediate result of the reset gate (W hr *h t-1 +b hr );
- the first multiplication operator and the second multiplication operator are the operators corresponding to the reset gate in some operators, and W ir , W hr , b ir , and b hr are the weights and offsets and reset respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit;
- the obtaining the output result r t of the reset gate specifically includes:
- the addition processing circuit inputs the input intermediate result and the output intermediate result of the reset gate into the first addition operator, and performs a sum operation on the input intermediate result and the output intermediate result to obtain a first summation result;
- the activation processing circuit inputs the first summation result into the first activation operator, performs a sigmoid activation operation on the first summation result, and obtains an output result r t of the reset gate;
- the first addition operator and the first activation operator are the operators corresponding to the reset gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; and the intermediate output result of the update gate specifically includes:
- the multiplication processing circuit inputs the received input data block and the weight and offset to the third multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, The product result and the element value of the corresponding position in the offset are summed to obtain the product result; the received output data h t-1 and the weight and offset are input to the fourth multiplication operator. Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result of the update gate (W iz *x t +b iz ), and performs an accumulation operation on another product result to obtain an output intermediate result of the reset gate (W hz *h t-1 +b hz );
- the third multiplication operator and the fourth multiplication operator are the operators corresponding to the update gate in some operators, and W ir , W hz , b ir and b hz are the weights and offsets corresponding to the update gate, respectively The first weight, the second weight, the first offset, and the second offset.
- the main processing circuit includes an activation processing circuit and an addition processing circuit;
- the obtained output result z t of the update gate specifically includes:
- the addition processing circuit inputs the input intermediate result and the output intermediate of the update gate into the second addition operator, performs a sum operation on the input intermediate result and the output intermediate, and obtains a second summation result;
- the activation processing circuit inputs the second summation result into the second activation operator, performs a sigmoid activation operation on the second summation result, and obtains the output result z t of the update gate ;
- the second addition operator and the second activation operator are operators corresponding to the update gate in another part of the operators.
- the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit; the obtaining the intermediate result of the current memory gate output specifically includes:
- the multiplication processing circuit inputs the received input data block, weight and offset to the fifth multiplication operator, and performs a product operation on the element value in the received input data block and the element value in the corresponding position in the weight, And sum the product result with the element value of the corresponding position in the offset to obtain the product result; input the received output data h t-1 , the weight and the offset into the sixth multiplication operator, and Perform the product operation on the element value in the output data h t-1 and the element value in the corresponding position in the weight, and perform a sum operation on the product result and another element value in the corresponding position in the offset to obtain another product result;
- the accumulation processing circuit performs an accumulation operation on the product result to obtain an input intermediate result (W in *x t +b in ) of the current memory gate, and performs an accumulation operation on another product result to obtain an output intermediate result of the current memory gate ( W nz *h t-1 +b nz );
- the multiplication processing circuit inputs the output result r t of the reset gate into the first vector multiplication operator, and performs a dot product operation on the output result r t of the reset gate and the intermediate result of the output of the current memory gate to obtain the first point Multiply the result
- the fifth multiplication operator, the sixth multiplication operator, and the first vector multiplication operator are the operators corresponding to the current memory gate in some operators, and W in , W hn , b in, and b hn are the weights and partial Center the first weight, the second weight, the first offset and the second offset corresponding to the current memory gate, respectively.
- the main processing circuit includes an activation processing circuit and an addition processing circuit;
- the obtaining the output result n t of the current memory gate specifically includes:
- the addition processing circuit inputs the input intermediate result of the current memory gate and the first dot product to the third addition operator, and performs a sum operation on the input intermediate result of the current memory gate and the first dot product to obtain the third Sum the results;
- the activation processing circuit inputs the third summation result into the third activation operator, performs a tanh activation operation on the third summation result, and obtains the output result n t of the current memory gate;
- the third addition operator and the third activation operator are the operators corresponding to the current memory gate in another part of the operators.
- the master processing circuit includes an addition processing circuit
- the slave processing circuit includes a multiplication processing circuit
- the output result of the determination output layer specifically includes:
- the main processing circuit updates the output gate z t, the output of the current memory and an output gate n t h t-1 data sent from the processing circuit;
- the multiplication circuit output gate update z t and the current output of the memory is input to the gate of a second n t vector multiplication operator, update the output of the output gate z t and the current memory door enforcement point n t Multiply to get the second point multiplication result, input the received update gate output result z t and output data h t-1 into the third vector multiplication operator, and update the gate output result z t and output data h t- 1 Perform dot multiplication to obtain the third dot multiplication result, and send the second dot multiplication result and the third dot multiplication result to the main processing circuit;
- the addition processing circuit inputs the current memory gate output result n t and the second dot product to the first subtraction operator, and performs a subtraction operation on the current memory gate output result n t and dot product to obtain the first difference Value result, input the third dot product result and the first difference result to the fourth addition operator, perform summation on the second dot product result and the first difference result, and obtain the output result h t ;
- the second vector multiplication operator and the third vector multiplication operator are operators corresponding to the output layer in some operators, and the first subtraction operator and the fourth addition operator are corresponding to the output layer in another operator operator.
- the main processing circuit includes a conversion processing circuit
- the conversion processing circuit inputs the output result h t to the shaping operator and the split operator in another part of the operators, and adjusts the data format of the output result h t to a preset format to obtain the final output result.
- This application also discloses a GRU device, which includes one or more computing devices mentioned in this application, which are used to obtain data to be calculated and control information from other processing devices, perform specified GRU operations, and the execution result passes I
- the /O interface is passed to peripheral devices.
- Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
- the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale convolutional neural network training operations.
- the interconnection method can be any interconnection topology.
- the GRU device has high compatibility, and is connected with various types of servers through a PCIE interface.
- the present application also discloses a combined processing device, which includes the above-mentioned GRU device, general interconnection interface, and other processing devices.
- the GRU computing device interacts with other processing devices to complete the operation specified by the user.
- Figures 2-7 are schematic diagrams of combined processing devices.
- Other processing devices include one or more types of general-purpose/special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
- the number of processors included in other processing devices is not limited.
- Other processing devices serve as the interface between the GRU computing device and external data and control, including data handling, and complete the basic control of starting and stopping the GRU computing device; other processing devices can also cooperate with the GRU computing device to complete the computing task.
- the general interconnection interface is used to transfer data and control instructions between the GRU device and other processing devices.
- the GRU device obtains required input data from other processing devices and writes to the storage device on the GRU device chip; it can obtain control instructions from other processing devices and write to the control buffer on the GRU device chip; it can also read the GRU device
- the data in the storage module is transmitted to other processing devices.
- the structure may further include a storage device, which is respectively connected to the GRU device and the other processing device.
- the storage device is used to store data stored in the GRU device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the GRU device or other processing device.
- the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
- the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
- a chip is also applied, which includes the above-mentioned GRU device or combined processing device.
- a chip packaging structure is applied, which includes the above chip.
- a board card is applied, which includes the above chip packaging structure.
- FIG. 2-9 provides a board card.
- the board card may also include other supporting components.
- the supporting components include but are not limited to: a storage device 390 and an interface device 391 And control device 392;
- the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
- the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
- DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
- the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
- each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transfer data twice in one clock cycle.
- a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
- the interface device is electrically connected to the chip in the chip packaging structure.
- the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
- the interface device may also be other interfaces.
- the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
- the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
- the control device is electrically connected to the chip.
- the control device is used to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a microcontroller (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
- the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Neurology (AREA)
- Design And Manufacture Of Integrated Circuits (AREA)
Abstract
Description
Claims (29)
- 一种计算装置,其特征在于,所述计算装置用于执行LSTM运算,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;所述存储单元,用于存储LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;所述控制器单元,用于获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,所述运算单元,用于依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
- 根据权利要求1所述的装置,其特征在于,所述运算单元包括:主处理电路以及从处理电路;所述控制器单元,具体用于根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;所述主处理电路,具体用于依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;所述从处理电路,用于依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;所述主处理电路,用于依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结果排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
- 根据权利要求2所述的装置,其特征在于,所述主处理电路,具体用于依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
- 根据权利要求3所述的装置,其特征在于,所述主处理电路,具体用于依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
- 根据权利要求2所述的装置,其特征在于,所述后续处理具体包括:如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;如为更新状态门,所述后续处理为激活运算tanh函数。
- 根据权利要求2所述的装置,其特征在于,所述主处理电路,还用于将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
- 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
- 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,所述分支处理电路,用于转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
- 根据权利要求2-6任意一项所述的装置,其特征在于,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;所述K个从处理电路,用于转发所述主处理电路以及多个从处理电路之间的数据以及算子。
- 根据权利要求2-6任意一项所述的装置,其特征在于,所述主处理电路包括:转换处理电路;所述转换处理电路,用于对数据执行转换处理,具体为:将主处理电路接收的数据执行第一数据结构与第二数据结构之间的互换。
- 根据权利要求2-6所述的装置,其特征在于,所述从处理电路包括:乘法处理电路和累加处理电路;所述乘法处理电路,用于对接收到的输入数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的乘积结果;接收到的输入状态数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的另一乘积结果;所述累加处理电路,用于对该各个门的乘积结果执行累加运算得到各个门的中间结果,将该各个门的另一乘积结果执行累加运算得到各个门的状态中间结果。
- 根据权利要求7所述的装置,其特征在于,所述树型模块为n叉树结构,所述n为大于等于2的整数。
- 一种LSTM运算装置,其特征在于,所述LSTM运算装置包括一个或多个如权利要求1-12任一项所述的计算装置,用于从其他处理装置中获取待运算数据和控制信息,并执行指定的LSTM运算,将执行结果通过I/O接口传递给其他处理装置;当所述LSTM装置包含多个所述计算装置时,所述多个所述计算装置间可以通过特定的结构进行连接并传输数据;其中,多个所述计算装置通过快速外部设备互连总线PCIE总线进行互联并传输数据,以支持更大规模的LSTM的运算;多个所述计算装置共享同一控制系统或拥有各自的控制系统;多个所述计算装置共享内存或者拥有各自的内存;多个所述计算装置的互联方式是任意互联拓扑。
- 一种组合处理装置,其特征在于,所述组合处理装置包括如权利要求13所述的LSTM运算装置,通用互联接口和其他处理装置;所述LSTM运算装置与所述其他处理装置进行交互,共同完成用户指定的计算操作。
- 根据权利要求14所述的组合处理装置,其特征在于,还包括:存储装置,该存储装置分别与所述LSTM运算装置和所述其他处理装置连接,用于保存所述LSTM运算装置和所述其他处理装置的数据。
- 一种神经网络芯片,其特征在于,所述神经网络芯片包括如权利要求1所述的计算装置或如权利要求13所述的LSTM运算装置或如权利要求15所述的组合处理装置。
- 一种电子设备,其特征在于,所述电子设备包括如所述权利要求16所述的芯片。
- 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求16所述的神经网络芯片;其中,所述神经网络芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述芯片与外部设备之间的数据传输;所述控制器件,用于对所述芯片的状态进行监控。
- 根据权利要求18所述的板卡,其特征在于,所述存储器件包括:多组存储单元,每一组所述存储单元与所述芯片通过总线连接,所述存储单元为:DDR SDRAM;所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;所述接口装置为:标准PCIE接口。
- 一种LSTM运算方法,其特征在于,所述方法应用于计算装置,所述LSTM包括:输入门、忘记门、输出门和更新状态门,所述计算装置包括:运算单元、控制器单元、存储单元;所述存储单元存储:LSTM运算算子、输入数据Xt、权值数据、输出数据ht、输入状态值Ct-1、输入结果ht-1、输出状态值Ct;所述方法包括如下步骤:所述控制器单元获取输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子,将输入数据Xt、权值数据、输入状态值Ct-1、输入结果ht-1、以及LSTM运算算子发送至运算单元,所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果,依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht以及输出状态值Ct。
- 根据权利要求20所述的方法,其特征在于,所述运算单元包括:主处理电路以及从处理电路;所述运算单元依据输入数据Xt、权值数据、输入结果ht-1以及LSTM运算算子执行输入门的运算、忘记门的运算、输出门的运算以及更新状态门的运算得到各个门的输出结果具体包括:所述控制器单元根据LSTM算子构建多个拆分算子、多个排序算子、乘法算子、激活算子以及加法算子;所述主处理电路依据排序算子将输入数据Xt、权值数据以及输入状态值进行重排序,所述权值数据包括:各个门的权值数据,然后依据拆分算法将各个门的权值数据以及乘法算子广播至从处理电路,将输入数据以及输入状态值拆分成多个输入数据块以及多个输入状态数据块,将多个输入数据块以及多个输入状态数据块分发给所述从处理电路;所述从处理电路依据乘法算子将所述多个输入数据块与各个门的权值数据执行乘法运算得到各个门的中间结果,依据乘法算子将所述多个输入状态数据块与各个门的权值数据执行乘法运算得到各个门的状态中间结果,将各个门的中间结果以及各个门的状态中间结果发送至主处理电路;所述主处理电路依据排序算子将每个门的中间结果排序得到各个门的排序结果,依据加法算子将各个门的排序结果执行偏置运算得到各个门的运算结果,依据排序算子将每个状态中间结排序得到各个门的状态排序结果,依据加法算子将各个门的状态排序结果执行偏置运算得到各个门的状态运算结果;依据加法算子将各个门的运算结果以及各个门的状态运算结果对应相加后进行后续处理得到各个门的输出结果。
- 根据权利要求21所述的方法,其特征在于,依据输入状态值Ct-1以及各个门的输出结果得到输出状态值Ct具体包括:所述主处理电路依据乘法算子将输入状态值Ct-1与忘记门的输出结果ft相乘得到第一结果,依据乘法算子将更新状态门的输出结果gt与输入门的输出结果it相乘得到第二结果,将第一结果与第二结果相加得到输出状态值Ct。
- 根据权利要求21所述的方法,其特征在于,所述依据输入状态值Ct-1以及各个门的输出结果得到输出数据ht具体包括:所述主处理电路依据激活算子对输出状态值Ct执行激活运算得到激活结果,将输出门的输出结果Ot与激活结果相乘得到输出结果ht。
- 根据权利要求21所述的方法,其特征在于,所述后续处理具体包括:如为忘记门、输入门和输出门,所述后续处理为sigmoid运算;如为更新状态门,所述后续处理为激活运算tanh函数。
- 根据权利要求21所述的方法,其特征在于,所述方法还包括:所述主处理电路将输出数据ht作为下一时刻的输入结果,将输出状态值Ct作为下一时刻的输入状态值。
- 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述运算单元包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;所述方法还包括:所述树型模块转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
- 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路,所述方法还包括:所述分支处理电路转发所述主处理电路与所述多个从处理电路之间的数据以及算子。
- 根据权利要求20-25任意一项所述的方法,其特征在于,如所述从处理电路的数量为多个,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;所述方法还包括:所述K个从处理电路所述主处理电路以及多个从处理电路之间的数据以及算子。
- 根据权利要求20-25所述的方法,其特征在于,所述从处理电路包括:乘法处理电路和累加处理电路;所述方法具体包括:所述乘法处理电路对接收到的输入数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的乘积结果;接收到的输入状态数据块中的元素值与各个门的权值中对应位置的元素值执行乘积运算得到各个门的另一乘积结果;所述累加处理电路对该各个门的乘积结果执行累加运算得到各个门的中间结果,将该各个门的另一乘积结果执行累加运算得到各个门的状态中间结果。
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811560966.5A CN109711540B (zh) | 2018-12-20 | 2018-12-20 | 一种计算装置及板卡 |
CN201811560966.5 | 2018-12-20 | ||
CN201811579542.3 | 2018-12-21 | ||
CN201811579542.3A CN109670581B (zh) | 2018-12-21 | 2018-12-21 | 一种计算装置及板卡 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020125092A1 true WO2020125092A1 (zh) | 2020-06-25 |
Family
ID=71100404
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/105932 WO2020125092A1 (zh) | 2018-12-20 | 2019-09-16 | 计算装置及板卡 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020125092A1 (zh) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106775599A (zh) * | 2017-01-09 | 2017-05-31 | 南京工业大学 | 递归神经网络的多计算单元粗粒度可重构系统及方法 |
CN107341542A (zh) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | 用于执行循环神经网络和lstm运算的装置和方法 |
US20180005676A1 (en) * | 2016-06-30 | 2018-01-04 | Samsung Electronics Co., Ltd. | Memory cell unit and recurrent neural network including multiple memory cell units |
CN108268939A (zh) * | 2016-12-30 | 2018-07-10 | 上海寒武纪信息科技有限公司 | 用于执行lstm神经网络运算的装置和运算方法 |
CN108446761A (zh) * | 2018-03-23 | 2018-08-24 | 中国科学院计算技术研究所 | 一种神经网络加速器及数据处理方法 |
CN108805273A (zh) * | 2018-05-20 | 2018-11-13 | 复旦大学 | 一种lstm中门控单元加速运算的硬件实现电路 |
CN109670581A (zh) * | 2018-12-21 | 2019-04-23 | 北京中科寒武纪科技有限公司 | 一种计算装置及板卡 |
CN109711540A (zh) * | 2018-12-20 | 2019-05-03 | 北京中科寒武纪科技有限公司 | 一种计算装置及板卡 |
-
2019
- 2019-09-16 WO PCT/CN2019/105932 patent/WO2020125092A1/zh active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107341542A (zh) * | 2016-04-29 | 2017-11-10 | 北京中科寒武纪科技有限公司 | 用于执行循环神经网络和lstm运算的装置和方法 |
US20180005676A1 (en) * | 2016-06-30 | 2018-01-04 | Samsung Electronics Co., Ltd. | Memory cell unit and recurrent neural network including multiple memory cell units |
CN108268939A (zh) * | 2016-12-30 | 2018-07-10 | 上海寒武纪信息科技有限公司 | 用于执行lstm神经网络运算的装置和运算方法 |
CN106775599A (zh) * | 2017-01-09 | 2017-05-31 | 南京工业大学 | 递归神经网络的多计算单元粗粒度可重构系统及方法 |
CN108446761A (zh) * | 2018-03-23 | 2018-08-24 | 中国科学院计算技术研究所 | 一种神经网络加速器及数据处理方法 |
CN108805273A (zh) * | 2018-05-20 | 2018-11-13 | 复旦大学 | 一种lstm中门控单元加速运算的硬件实现电路 |
CN109711540A (zh) * | 2018-12-20 | 2019-05-03 | 北京中科寒武纪科技有限公司 | 一种计算装置及板卡 |
CN109670581A (zh) * | 2018-12-21 | 2019-04-23 | 北京中科寒武纪科技有限公司 | 一种计算装置及板卡 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109522052B (zh) | 一种计算装置及板卡 | |
CN109543832B (zh) | 一种计算装置及板卡 | |
CN109104876B (zh) | 一种运算装置及相关产品 | |
CN111047022B (zh) | 一种计算装置及相关产品 | |
CN111488976B (zh) | 神经网络计算装置、神经网络计算方法及相关产品 | |
CN109670581B (zh) | 一种计算装置及板卡 | |
CN109711540B (zh) | 一种计算装置及板卡 | |
CN111488963B (zh) | 神经网络计算装置和方法 | |
CN110059797B (zh) | 一种计算装置及相关产品 | |
CN111767995B (zh) | 运算方法、装置及相关产品 | |
CN111047021B (zh) | 一种计算装置及相关产品 | |
CN110059809B (zh) | 一种计算装置及相关产品 | |
CN111368967B (zh) | 一种神经网络计算装置和方法 | |
CN111368986B (zh) | 一种神经网络计算装置和方法 | |
CN111382848B (zh) | 一种计算装置及相关产品 | |
CN111381882B (zh) | 数据处理装置及相关产品 | |
CN112766475B (zh) | 处理部件及人工智能处理器 | |
WO2020125092A1 (zh) | 计算装置及板卡 | |
CN111367567B (zh) | 一种神经网络计算装置和方法 | |
CN110472734A (zh) | 一种计算装置及相关产品 | |
CN111368987B (zh) | 一种神经网络计算装置和方法 | |
CN111368990B (zh) | 一种神经网络计算装置和方法 | |
CN111258641B (zh) | 运算方法、装置及相关产品 | |
CN111291871A (zh) | 一种计算装置及相关产品 | |
CN111260046A (zh) | 运算方法、装置及相关产品 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19899228 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19899228 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19899228 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 21/01/2022) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19899228 Country of ref document: EP Kind code of ref document: A1 |