CN109670581B

CN109670581B - Computing device and board card

Info

Publication number: CN109670581B
Application number: CN201811579542.3A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Cambricon Technologies Corp Ltd
Current assignee: Cambricon Technologies Corp Ltd
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2023-05-23
Anticipated expiration: 2038-12-21
Also published as: CN109670581A

Abstract

The application provides a computing device and board, computing device is used for carrying out LSTM operation, the board includes: the device comprises a storage device, an interface device, a control device and a neural network chip, wherein the neural network chip comprises a computing device, and the storage device is used for storing data; the interface device is used for realizing data transmission between the chip and external equipment; the control device is used for monitoring the state of the chip. The computing device provided by the application has the advantage of low power consumption.

Description

Computing device and board card

Technical Field

The application relates to the technical field of information processing, in particular to a computing device and a board card.

Background

A long-short-time memory network (LSTM) is a type of time-Recurrent Neural Network (RNN) that is suitable for processing and predicting important events with very long intervals and delays in a time series, due to the unique structural design of the network itself. LSTM networks exhibit better performance than traditional recurrent neural networks, and are well suited for learning from experience to classify, process and predict time series after an unknown size time exists between significant events. Currently, LSTM networks are widely used in many fields, such as speech recognition, video description, machine translation, and automatic synthesis of music.

The existing LSTM network is realized based on a general-purpose processor, and the existing processor has high energy consumption for executing LSTM operation.

Disclosure of Invention

The embodiment of the application provides a computing device and related products, which can improve the processing speed of LSTM and save power consumption.

In a first aspect, there is provided a computing device for performing LSTM operations, the LSTM comprising: an input gate, a forget gate, an output gate, and an update status gate, the computing device comprising: the device comprises an operation unit, a controller unit and a storage unit;

the storage unit is used for storing the LSTM operator and the input data X _t Weight data, output data h _t Input state value C _t－1 Input result h _t－1 Output state value C _t ；

The controller unit is used for acquiring input data X _t Weight data, input state value C _t－1 Input result h _t－1 And LSTM operator, to input data X _t Weight data, input state value C _t－1 Input result h _t－1 And the LSTM operator to the arithmetic unit,

the operation unit is used for inputting data X _t Weight data, input result h _t－1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate, and the output result is based on the input state value C _t－1 The output result of each gate obtains output data h _t Output state value C _t 。

Optionally, the operation unit includes: a master processing circuit and a slave processing circuit;

the controller unit is specifically configured to construct a plurality of splitting operators, a plurality of sorting operators, a multiplication operator, an activation operator and an addition operator according to the LSTM operator;

the main processing circuit is specifically configured to input data X according to the sorting operator _t Reordering weight data and input state values, the weight data comprising: the weight data of each gate is broadcast to a slave processing circuit according to a splitting algorithm, input data and input state values are split into a plurality of input data blocks and a plurality of input state data blocks, and the plurality of input data blocks and the plurality of input state data blocks are distributed to the slave processing circuit;

the slave processing circuit is used for performing multiplication operation on the plurality of input data blocks and the weight data of each gate according to the multiplier to obtain an intermediate result of each gate, performing multiplication operation on the plurality of input state data blocks and the weight data of each gate according to the multiplier to obtain a state intermediate result of each gate, and transmitting the intermediate result of each gate and the state intermediate result of each gate to the master processing circuit;

The main processing circuit is used for ordering the intermediate results of each gate according to the ordering operator to obtain ordering results of each gate, executing offset operation on the ordering results of each gate according to the adding operator to obtain operation results of each gate, ordering each state intermediate node according to the ordering operator to obtain state ordering results of each gate, and executing offset operation on the state ordering results of each gate according to the adding operator to obtain state operation results of each gate; and correspondingly adding the operation result of each gate and the state operation result of each gate according to the addition operator, and then carrying out subsequent processing to obtain the output result of each gate.

Optionally, the main processing circuit is specifically configured to input the state value C according to the multiplier _t－1 Output result f of forgetting gate _t Multiplying to obtain a first result, and updating the output result g of the state gate according to the multiplication operator _t Output result i of AND input gate _t Multiplying to obtain a second result, and adding the first result and the second result to obtain an output state value C _t 。

Optionally, the main processing circuit is specifically configured to output the state value C according to the pair of activating operators _t Performing an activation operation to obtain an activation result, and outputting an output result O of the output gate _t Multiplying the activated result to obtain an output result h _t 。

Optionally, the subsequent processing specifically includes:

if the input door is a forgetting door, the input door is an output door, the subsequent processing is sigmoid operation;

if the state gate is updated, the subsequent processing is to activate the operation tanh function.

Optionally, the main processing circuit is further configured to output data h _t As the input result of the next moment, the state value C is output _t As the input state value for the next time.

Optionally, as the number of the slave processing circuits is plural, the operation unit includes: a tree module, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

the tree module is used for forwarding data and operators between the master processing circuit and the plurality of slave processing circuits.

Optionally, if the number of the slave processing circuits is plural, the operation unit further includes one or more branch processing circuits, each branch processing circuit is connected to at least one slave processing circuit,

the branch processing circuit is used for forwarding data and operators between the master processing circuit and the plurality of slave processing circuits.

Optionally, if the number of the secondary processing circuits is a plurality, the plurality of secondary processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k basic circuits are: n slave processing circuits of the 1 st row, n slave processing circuits of the m th row, and m slave processing circuits of the 1 st column;

the K slave processing circuits are used for forwarding data and operators between the master processing circuit and the plurality of slave processing circuits.

Optionally, the main processing circuit includes: a conversion processing circuit;

the conversion processing circuit is used for executing conversion processing on the data, and specifically comprises the following steps: the data received by the main processing circuit is interchanged between the first data structure and the second data structure.

Optionally, the slave processing circuit includes: a multiplication processing circuit and an accumulation processing circuit;

the multiplication processing circuit is used for performing product operation on the element values in the received input data block and the element values at corresponding positions in the weight values of the gates to obtain a product result of the gates; performing product operation on the element value in the received input state data block and the element value in the corresponding position in the weight of each gate to obtain another product result of each gate;

And the accumulation processing circuit is used for executing accumulation operation on the product result of each gate to obtain an intermediate result of each gate, and executing accumulation operation on the other product result of each gate to obtain a state intermediate result of each gate.

Optionally, the tree module is in an n-way tree structure, and n is an integer greater than or equal to 2.

In a second aspect, an embodiment of the present application provides an LSTM operation device, where the LSTM operation device includes one or more computing devices provided in the first aspect, configured to obtain data to be operated and control information from another processing device, perform specified LSTM operation, and transmit an execution result to the other processing device through an I/O interface;

when the LSTM device includes a plurality of the computing devices, the plurality of the computing devices may be connected and data may be transmitted through a specific structure;

the computing devices are interconnected through a PCIE bus of a rapid external device interconnection bus and transmit data so as to support the operation of LSTM in a larger scale; a plurality of the computing devices share the same control system or have respective control systems; a plurality of computing devices share memory or have respective memories; the manner in which the plurality of computing devices are interconnected is an arbitrary interconnection topology.

In a third aspect, a combination processing device is provided, where the combination processing device includes the LSTM computation device of the second aspect, a universal interconnect interface, and other processing devices;

and the LSTM operation device interacts with the other processing devices to jointly complete the calculation operation designated by the user.

In a fourth aspect, a neural network chip is provided, where the neural network chip includes the computing device provided in the first aspect or the LSTM computing device provided in the second aspect or the combination processing device provided in the third aspect.

In a fifth aspect, there is provided an electronic device comprising a chip as provided in the fourth aspect.

In a sixth aspect, there is provided a board card comprising: a memory device, an interface device, and a control device, and a neural network chip provided in the fourth aspect;

the neural network chip is respectively connected with the storage device, the control device and the interface device;

the storage device is used for storing data;

the interface device is used for realizing data transmission between the chip and external equipment;

the control device is used for monitoring the state of the chip.

In a seventh aspect, embodiments of the present application further provide an LSTM operation method, where the LSTM includes: the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate, the computing device comprising: the device comprises an operation unit, a controller unit and a storage unit; the storage unit stores: LSTM operator, input data X _t Weight data, output data h _t Input state value C _t－1 Input result h _t－1 Output state value C _t ；

The method comprises the following steps:

the controller unit acquires input data X _t Weight data, input state value C _t－1 Input result h _t－1 And LSTM operator, to input data X _t Weight data, input state value C _t－1 Input result h _t－1 And the LSTM operator to the arithmetic unit,

In some embodiments, the electronic device comprises a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

In some embodiments, the vehicle comprises an aircraft, a ship, and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of an LSTM structure

Fig. 2 is a schematic structural diagram of a computing device according to an embodiment of the present application.

Fig. 2a is a schematic structural diagram of an arithmetic unit according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of another computing device provided herein.

Fig. 3a is a schematic diagram of a main processing circuit provided in the present application.

Fig. 4a is a schematic structural diagram of a transmitting end of a tree module provided in the present application.

Fig. 4b is a schematic structural diagram of a receiving end of a tree module provided in the present application.

Fig. 4c is a schematic diagram of a binary tree structure provided herein.

FIG. 5 is a block diagram of a computing device provided in one embodiment of the present application.

Fig. 6 is a flowchart of an LSTM operation method according to an embodiment of the present application.

Fig. 7 is a block diagram of a combination processing apparatus according to an embodiment of the present application.

Fig. 8 is a block diagram of another combination processing apparatus according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a board according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Referring to FIG. 1, FIG. 1 is a schematic diagram of an LSTM, as shown in FIG. 1, comprising: the input gate, the forget gate, the update state unit and the output gate have the following corresponding calculation formulas:

f _t ＝σ(W _f [h _t-1 ,x _t ]+b _f

i _t ＝σ(W _i [h _t-1 ,x _t ]+b _i

g _t ＝tanh(W _g [h _t-1 ,x _t ]+b _g

C _t ＝C _t-1 ⊙f _t +g _t ⊙i _t

O _t ＝σ(W _o [h _t-1 ,x _t ]+b _o

h _t ＝O _t ⊙tanh(C _t )

wherein X is _t Is the input data at the t time, h _t－1 Output data at time t-1, W _f 、W _i 、W _g And W is _o B, respectively representing weight vectors corresponding to the forget gate, the input gate, the update state unit and the output gate _f 、b _i 、b _c And b _o The corresponding bias of the forgetting gate, the input gate, the updating state unit and the output gate are respectively indicated; f (f) _t The output of the forgetting gate is displayed, and the forgetting gate and the state unit at the time t-1 are subjected to dot multiplication to select the forgotten state unit value; i.e _t The output of the input gate is multiplied by the obtained candidate state value point at the time t to selectively add the candidate state value at the time t to the updated state unit; g _t Representing candidate state values obtained by calculation at the moment t; c _t Representing a new state value obtained by selectively forgetting the state value at time t-1 and selectively adding the state value at time t, c _t Will be used at the calculation final output instant and transmitted to the next instant; o (O) _t A selection condition which needs to be output as a result part in the update state unit at the time t is represented; h is a _t Representing the output at time t, while it will also be transmitted to the next time (i.e., time t+1); the product of vector per element operation; sigma is a sigmoid function, and the calculation formula is:

the calculation formula of the activation function tanh function is +.>

. At the time of concrete calculation, the application will W _f 、W _i 、W _g And W is _o Splicing into a matrix W, b _f 、b _i 、b _c And b _o Make up into oneAnd a matrix b.

Referring to fig. 2, fig. 2 is a schematic diagram of a computing device provided in the present application. Referring to fig. 2, there is provided a computing device for performing LSTM operations, the computing device comprising: a controller unit 11, an arithmetic unit 12, and a storage unit 10, wherein the controller unit 11 is connected to the arithmetic unit 12 and the storage unit 10, and the arithmetic unit 12 includes: a master processing circuit 101 and a slave processing circuit 102 (which may be one or more slave processing circuits, preferably a plurality of slave processing circuits);

It should be noted that the main processing circuit itself includes a memory (e.g., a memory or a register), which may store some data of the main processing circuit, and the slave processing circuit may select the portable memory.

LSTM includes: an input gate, a forget gate, an output gate, and an update status gate;

a memory unit 10 for storing LSTM operator and input data X _t Weight data, output data h _t Input state value C _t－1 Input result h _t－1 Output state value C _t ；

A controller unit 11 for acquiring input data X _t Weight data, input state value C _t－1 Input result h _t－1 And LSTM operator, to input data X _t Weight data, input state value C _t－1 Input result h _t－1 And the LSTM operator to the arithmetic unit,

an arithmetic unit 12 for outputting input data X _t Weight data, input result h _t－1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate, and the output result is based on the input state value C _t－1 The output result of each gate obtains output data h _t Output state value C _t 。

Optionally, the controller unit is specifically configured to construct a plurality of splitting operators, a plurality of sorting operators, a multiplication operator, an activation operator and an addition operator according to the LSTM operator;

it should be noted that the operations of each of the above gates are relatively independent, and the calculation results are relatively independent, i.e. each gate has respective weight data, such as W _f 、W _i 、W _g And W is _o Respectively representing weight data of 4 gates.

The performing multiplication operation on the plurality of input data blocks and the weight data of each gate according to the multiplier to obtain the intermediate result of each gate may specifically include:

and performing multiplication operation on the plurality of input data blocks and the input gate weight data to obtain an intermediate result of the input gate, performing multiplication operation on the plurality of input data blocks and the output gate weight data to obtain an intermediate result of the output gate, performing multiplication operation on the plurality of input data blocks and the forget gate weight data to obtain an intermediate result of the forget gate, and performing multiplication operation on the plurality of input data blocks and the update state gate weight data to obtain an intermediate result of the update state gate. The intermediate results of the states of the respective gates are similar to those of the respective gates, and are not described herein.

According to the technical scheme, the operation unit is set to be in a master-slave structure, for the forward operation of the LSTM, input data at the moment and output data of the forgetting gate are split and processed in parallel, so that the part with larger calculation amount can be subjected to parallel operation through the master processing circuit and the slave processing circuit, the operation speed is improved, the operation time is saved, and the power consumption is further reduced.

Optionally, the subsequent processing specifically includes:

The LSTM may include a plurality of hidden layers, h is an integer greater than or equal to 2, and for the operation of any intermediate hidden layer in the LSTM, the operation of the h hidden layer may be performed by the plurality of LSTM, where in the forward operation, when the execution of the previous time t-1 is completed, the output result t-1 is obtained, the operator at the current time t uses the output result t-1 at the previous time as input data of the forgetting gate at the next time, the forgetting gate determines the pass rate of the previous time output result t-1 through the sigmoid, so that the output result t at the moment of forgetting gate is obtained, the output result t and the weight are operated, the input data of the other part of operation is the input data of the input layer at the moment t is used as the other part of input neuron, then the product operation is performed by the two parts of input neurons respectively, the output result at the moment t is obtained by adding the two operation results, and then the output result at the moment t is used as input data of the forgetting gate at the next time t+1, so that the pass rate at the previous time can be selectively determined.

Optionally, the computing device may further include: direct memory access unit 50, storage unit 10 may include: one or any combination of a register and a cache, specifically, the cache is used for storing a calculation operator; the register is used for storing the input data and scalar; the cache is a scratch pad cache. The direct memory access unit 50 is used for reading or storing data from the storage unit 10.

Optionally, the controller unit includes: an operator storage unit 110, an operator processing unit 111, and a store queue unit 113;

an operator storage unit 110, configured to store the computation operator associated with the LSTM operation;

the operator processing unit 111 is configured to parse the computation operator to obtain a plurality of computation operators;

a store queue unit 113 for storing an operator queue, the operator queue comprising: a plurality of operators or computation operators to be executed in the order of the queue.

Optionally, the controller unit may further include:

the dependency relationship processing unit 108 is configured to determine, when there are a plurality of operators, whether a first operator has an association relationship with a zeroth operator before the first operator, if the first operator has an association relationship with the zeroth operator, cache the first operator in the operator storage unit, and extract the first operator from the operator storage unit and transmit the first operator to the operator unit after the zeroth operator is executed;

The determining whether the first operator and the zeroth operator before the first operator have an association relation comprises:

extracting a first storage address interval of required data (such as a matrix) in the first operator according to the first operator, extracting a zeroth storage address interval of the required matrix in the zeroth operator according to the zeroth operator, if the first storage address interval and the zeroth storage address interval have overlapping areas, determining that the first operator and the zeroth operator have an association relationship, if the first storage address interval and the zeroth storage address interval do not have overlapping areas, and determining that the first operator and the zeroth operator do not have an association relationship.

In an alternative embodiment, the arithmetic unit 12 may comprise one master processing circuit 101 and a plurality of slave processing circuits 102, as shown in fig. 3. In one embodiment, as shown in FIG. 3, a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, and the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, wherein the k slave processing circuits are as follows: the K slave processing circuits shown in fig. 3 include only the n slave processing circuits in the 1 st row, the n slave processing circuits in the m th row, and the m slave processing circuits in the 1 st column, that is, the K slave processing circuits are slave processing circuits directly connected to the master processing circuit among the plurality of slave processing circuits.

K slave processing circuits for data (which may be input data blocks, input state data blocks, intermediate results, state intermediate results, etc.) and operator forwarding between the master processing circuit and the plurality of slave processing circuits.

Optionally, as shown in fig. 3a, the main processing circuit may further include: one or any combination of a conversion processing circuit 110, an activation processing circuit 111, and an addition processing circuit 112;

the conversion processing circuit 110 is configured to perform conversion processing on data, specifically: the data received by the main processing circuit (including but not limited to input data X _t Weight data (weight of each gate), input state value C _t－1 Input result h _t－1 ) An exchange between the first data structure and the second data structure (e.g., a conversion of continuous data and discrete data, such as a conversion of floating point data and fixed point data) is performed.

An activation processing circuit 111 for performing an activation operation of data in the main processing circuit;

the addition processing circuit 112 is used for executing addition operation or accumulation operation.

In another embodiment, the operator is a matrix multiplied operator, an accumulation operator, an activation operator, or the like.

In an alternative embodiment, as shown in fig. 4a, the arithmetic unit comprises: a tree module 40, the tree module comprising: a root port 401 and a plurality of branch ports 404, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

The above-mentioned tree module has a transmitting and receiving function, for example, as shown in fig. 4a, and is a transmitting function, as shown in fig. 4b, and is a receiving function.

The tree module is configured to forward data (the data may be an input data block, an input status data block, an intermediate result, a status intermediate result, etc.) between the master processing circuit and the plurality of slave processing circuits.

Alternatively, the tree module is an optional result of the computing device, which may include at least a layer 1 node, which is a line structure with forwarding functionality, and which may not itself have computing functionality. Such as a tree module, has zero level nodes, i.e., the tree module is not required.

Alternatively, the tree module may be in a tree structure of n-branches, for example, a tree structure of two branches as shown in fig. 4c, or may be in a tree structure of three branches, where n may be an integer greater than or equal to 2. The specific embodiment of the present application is not limited to the specific value of n, and the number of layers may be 2, and the processing circuit may be connected to a node of a layer other than the node of the penultimate layer, for example, may be connected to the node of the penultimate layer as shown in fig. 4 c.

Alternatively, the above-mentioned arithmetic unit may carry a separate cache, as shown in fig. 2a, and may include: a neuron buffering unit 63 which buffers the input neuron vector data and the output neuron value data of the slave processing circuit.

As in fig. 2a, the arithmetic unit may further comprise: the weight buffer unit 64 is used for buffering the weight data required by the slave processing circuit in the calculation process.

In an alternative embodiment, the arithmetic unit 12 may comprise a branch processing circuit 103 as shown in fig. 5; a specific connection structure thereof is shown in fig. 5, in which,

the above-mentioned branch processing circuit 103 may include a memory, as shown in fig. 5, and the size of the memory of the branch processing circuit 103 may be between 2 and 2.5 times the maximum data capacity that a single slave processing circuit needs to store, so that after setting, the slave processing circuit does not need to set the memory, and instead of one branch processing circuit, it only needs to set 2.5 hr (the capacity value required by the single slave processing circuit), if there is no branch processing circuit, it needs to set 4 hr, and the utilization of its registers is low, so that the structure can effectively reduce the total capacity of the memory and reduce the cost.

The branch processing circuitry is configured to forward data (which may be input data blocks, input state data blocks, intermediate results, state intermediate results, etc.) between the master processing circuitry and the plurality of slave processing circuitry.

The above-mentioned splitting of the input data (the splitting of the input state data may be the same as the splitting of the input data) will be described by way of example, and the splitting of the input data is basically the same as the splitting of the input data because of the same data type, and assuming that the data type is a matrix, the splitting may be performed by splitting the matrix H_W into H vectors (each vector is a row of the matrix H_W) in the H direction if the value of H is smaller than a set threshold value, for example 100, each vector being an input data block, and marking the position of the first element of the input data block in the input data block, that is, the input data block _h，w Wherein h and w are respectively input data blocks _h，w The value of the first element in the H direction and in the W direction, e.g. the first input data block, h= 1.w =1. Receiving an input data block from a processing circuit _h，w Thereafter, the data block is input _h，w Multiplying and accumulating the elements in each column of the weight value in a one-to-one correspondence manner to obtain an intermediate result _w，i The intermediate result w is the w value of the input data block, i is the column number value of the column element calculated with the input data block, and the main processing circuit determines the position of the intermediate result at the operation result of the corresponding gate to be w and i. For example, input data Block _1，1 Input intermediate result obtained by calculation with first column of weight _1，1 The main processing circuit will input intermediate results _1，1 Arranged in a first row and a first column of the operation result of the corresponding gate.

The application also provides an LSTM operation method, which is applied to a computing device, and comprises the following steps: the LSTM includes: an input gate, a forget gate, an output gate, and an update status gate, the computing device comprising: the device comprises an operation unit, a controller unit and a storage unit; the storage unit stores: LSTM operator, input data X _t Weight data, output data h _t Input state value C _t－1 Input result h _t－1 Output state value C _t The method comprises the steps of carrying out a first treatment on the surface of the The method comprises the following stepsThe steps are as follows:

step S601, the controller unit obtains input data X _t Weight data, input state value C _t－1 Input result h _t－1 And LSTM operator, to input data X _t Weight data, input state value C _t－1 Input result h _t－1 And the LSTM operator to the arithmetic unit,

step S601, the operation unit is based on the input data X _t Weight data, input result h _t－1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate, and the output result is based on the input state value C _t－1 The output result of each gate obtains output data h _t Output state value C _t 。

Optionally, the operation unit includes: a master processing circuit and a slave processing circuit; the operation unit is used for inputting data X _t Weight data, input result h _t－1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate specifically comprises:

the controller unit constructs a plurality of splitting operators, a plurality of sequencing operators, a multiplication operator, an activation operator and an addition operator according to the LSTM operator;

the main processing circuit inputs data X according to the sorting operator _t Reordering weight data and input state values, the weight data comprising: the weight data of each gate is broadcast to a slave processing circuit according to a splitting algorithm, input data and input state values are split into a plurality of input data blocks and a plurality of input state data blocks, and the plurality of input data blocks and the plurality of input state data blocks are distributed to the slave processing circuit;

the slave processing circuit performs multiplication operation on the plurality of input data blocks and weight data of each gate according to the multiplier to obtain intermediate results of each gate, performs multiplication operation on the plurality of input state data blocks and weight data of each gate according to the multiplier to obtain intermediate results of each gate, and sends the intermediate results of each gate and the intermediate results of each gate to the master processing circuit;

The main processing circuit sequences the intermediate results of each gate according to the sequencing operator to obtain the sequencing results of each gate, performs offset operation on the sequencing results of each gate according to the adding operator to obtain the operation results of each gate, sequences the intermediate junctions of each state according to the sequencing operator to obtain the state sequencing results of each gate, and performs offset operation on the state sequencing results of each gate according to the adding operator to obtain the state operation results of each gate; and correspondingly adding the operation result of each gate and the state operation result of each gate according to the addition operator, and then carrying out subsequent processing to obtain the output result of each gate.

Optionally, according to the input state value C _t－1 The output result of each gate obtains an output state value C _t The method specifically comprises the following steps:

the main processing circuit inputs the state value C according to the multiplication operator _t－1 Output result f of forgetting gate _t Multiplying to obtain a first result, and updating the output result g of the state gate according to the multiplication operator _t Output result i of AND input gate _t Multiplying to obtain a second result, and adding the first result and the second result to obtain an output state value C _t 。

Optionally, the input state value C is based on _t－1 The output result of each gate obtains output data h _t The method specifically comprises the following steps:

The main processing circuit outputs a state value C according to the activating operator pair _t Performing an activation operation to obtain an activation result, and outputting an output result O of the output gate _t Multiplying the activated result to obtain an output result h _t 。

The application also discloses an LSTM device, which comprises one or more computing devices mentioned in the application, and is used for acquiring data to be operated and control information from other processing devices, executing specified LSTM operation, and transmitting an execution result to peripheral equipment through an I/O interface. Peripheral devices such as cameras, displays, mice, keyboards, network cards, wifi interfaces, servers. When more than one computing device is included, the computing devices may be linked and data transferred by a specific structure, such as interconnection and data transfer via a PCIE bus, to support larger scale operations for convolutional neural network training. At this time, the same control system may be shared, or independent control systems may be provided; the memory may be shared, or each accelerator may have its own memory. In addition, the interconnection mode can be any interconnection topology.

The LSTM device has higher compatibility and can be connected with various types of servers through PCIE interfaces.

The application also discloses a combined processing device, which comprises the LSTM device, a universal interconnection interface and other processing devices. The LSTM operation device interacts with other processing devices to jointly complete the operation designated by the user. Fig. 7 is a schematic diagram of a combination processing apparatus.

Other processing means may include one or more processor types of general purpose/special purpose processors such as Central Processing Units (CPU), graphics Processing Units (GPU), neural network processors, etc. The number of processors included in the other processing means is not limited. Other processing devices are used as interfaces between the LSTM arithmetic device and external data and control, including data carrying, and basic control such as starting and stopping of the LSTM arithmetic device is completed; other processing devices may also cooperate with the LSTM computing device to perform the computing task.

And the universal interconnection interface is used for transmitting data and control operators between the LSTM device and other processing devices. The LSTM device acquires required input data from other processing devices and writes the required input data into a storage device on the LSTM device chip; control operators can be obtained from other processing devices and written into a control cache on the LSTM device chip; the data in the memory module of the LSTM device may also be read and transferred to other processing devices.

Optionally, as shown in fig. 8, the structure may further include a storage device, where the storage device is connected to the LSTM device and the other processing device, respectively. The storage device is used for storing the data in the LSTM device and the other processing devices, and is particularly suitable for the data which is required to be operated and cannot be stored in the LSTM device or the other processing devices.

The combined processing device can be used as an SOC (system on chip) system of equipment such as a mobile phone, a robot, an unmanned aerial vehicle, video monitoring equipment and the like, so that the core area of a control part is effectively reduced, the processing speed is improved, and the overall power consumption is reduced. In this case, the universal interconnect interface of the combined processing apparatus is connected to some parts of the device. Some components such as cameras, displays, mice, keyboards, network cards, wifi interfaces.

In some embodiments, a chip is also claimed that includes the above LSTM device or combination processing device.

In some embodiments, a chip package structure is disclosed, which includes the chip.

In some embodiments, a board card is provided that includes the chip package structure described above. Referring to fig. 9, fig. 9 provides a board that may include other mating components in addition to the chip 389, including but not limited to: a memory device 390, an interface device 391 and a control device 392;

The memory device 390 is connected to the chip in the chip package structure through a bus for storing data. The memory device may include multiple sets of memory cells 393. Each group of storage units is connected with the chip through a bus. It is understood that each set of memory cells may be DDR SDRAM (English: double Data Rate SDRAM, double Rate synchronous dynamic random Access memory).

DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read out on both the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the memory device may include 4 sets of the memory cells. Each set of the memory cells may include a plurality of DDR4 particles (chips). In one embodiment, the chip may include 4 72-bit DDR4 controllers inside, where 64 bits of the 72-bit DDR4 controllers are used to transfer data and 8 bits are used for ECC verification. It is understood that the theoretical bandwidth of data transfer can reach 25600MB/s when DDR4-3200 granules are employed in each set of memory cells.

In one embodiment, each set of memory cells includes a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each storage unit.

The interface device is electrically connected with the chip in the chip packaging structure. The interface means is used for enabling data transmission between the chip and an external device, such as a server or a computer. For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through the standard PCIE interface, so as to implement data transfer. Preferably, when using PCIE3.0X16 interface transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may be another interface, and the application is not limited to the specific form of the other interface, and the interface unit may be capable of implementing a switching function. In addition, the calculation result of the chip is still transmitted back to the external device (e.g. a server) by the interface device.

The control device is electrically connected with the chip. The control device is used for monitoring the state of the chip. Specifically, the chip and the control device may be electrically connected through an SPI interface. The control device may comprise a single chip microcomputer (Micro Controller Unit, MCU). The chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may drive a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of a plurality of processing chips, a plurality of processing circuits and/or a plurality of processing circuits in the chip.

In some embodiments, an electronic device is provided that includes the above board card.

The electronic device includes a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.

The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several operators for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by operator-related hardware by a program, which may be stored in a computer-readable memory, and the memory may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. A computing device for performing LSTM operations, the LSTM comprising: an input gate, a forget gate, an output gate, and an update status gate, the computing device comprising: the device comprises an operation unit, a controller unit and a storage unit;

The storage unit is used for storing the LSTM operator and the input data X _t Weight data, output data h _t Input state value C _t-1 Input result h _t-1 Output state value C _t ；

The controller unit is used for acquiring input data X _t Weight data, input state value C _t-1 Input result h _t-1 And LSTM operator, to input data X _t Weight data, input state value C _t-1 Input result h _t-1 And the LSTM operator to the arithmetic unit,

the operation unit is used for inputting data X _t Weight data, input result h _t-1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate, and the output result is based on the input state value C _t-1 The output result of each gate obtains output data h _t Output state value C _t The method comprises the steps of carrying out a first treatment on the surface of the The arithmetic unit includes: a master processing circuit and a slave processing circuit;

2. The apparatus of claim 1, wherein the device comprises a plurality of sensors,

the main processing circuit is specifically configured to input the state value C according to the multiplier _t-1 Output result f of forgetting gate _t Multiplying to obtain a first result, and updating the output result g of the state gate according to the multiplication operator _t Output result i of AND input gate _t Multiplying to obtain a second result, and adding the first result and the second result to obtain an output state value C _t 。

3. The apparatus of claim 2, wherein the device comprises a plurality of sensors,

the main processing circuit is specifically configured to output a state value C according to the pair of activation operators _t Performing an activation operation to obtain an activation result, and outputting an output result O of the output gate _t Multiplying the activated result to obtain an output result h _t 。

4. The apparatus according to claim 1, wherein the subsequent processing specifically comprises:

5. The apparatus of claim 1, wherein the device comprises a plurality of sensors,

the main processing circuit is also used for outputting data h _t As the input result of the next moment, the state value C is output _t As the input state value for the next time.

6. The apparatus according to any one of claims 1 to 5, wherein the operation unit includes: a tree module, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits;

7. The apparatus according to any one of claims 1-5, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit being connected to at least one slave processing circuit,

8. The apparatus of any of claims 1-5, wherein the number of slave processing circuits is a plurality, the plurality of slave processing circuits being distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are: n slave processing circuits of the 1 st row, n slave processing circuits of the m th row, and m slave processing circuits of the 1 st column;

9. The apparatus of any of claims 1-5, wherein the main processing circuit comprises: a conversion processing circuit;

10. The apparatus of any of claims 1-5, wherein the slave processing circuit comprises: a multiplication processing circuit and an accumulation processing circuit;

11. The apparatus of claim 6, wherein the tree module is an n-ary tree structure, and n is an integer greater than or equal to 2.

12. An LSTM computing device, wherein the LSTM computing device includes one or more computing devices according to any one of claims 1 to 11, configured to obtain data to be computed and control information from other processing devices, perform specified LSTM operations, and transmit an execution result to the other processing devices through an I/O interface;

when the LSTM computing device comprises a plurality of computing devices, the computing devices can be connected through a specific structure and data can be transmitted;

the computing devices are interconnected through a PCIE bus of a rapid external device interconnection bus and transmit data so as to support the operation of LSTM in a larger scale; a plurality of the computing devices share the same control system or have respective control systems; a plurality of computing devices share memory or have respective memories; the interconnection mode of a plurality of computing devices is any interconnection topology;

the arithmetic unit includes: a master processing circuit and a slave processing circuit;

13. A combination processing device, comprising the LSTM computation device of claim 12, a universal interconnect interface and other processing devices;

14. The combination processing device of claim 13, further comprising: and a storage device connected to the LSTM operation device and the other processing device, respectively, for storing data of the LSTM operation device and the other processing device.

15. A neural network chip, characterized in that it comprises a computing device according to claim 1 or an LSTM computing device according to claim 12 or a combination processing device according to claim 14.

16. An electronic device comprising the chip of claim 15.

17. A board, characterized in that, the board includes: a memory device, an interface device and a control device, and a neural network chip as claimed in claim 15;

The storage device is used for storing data;

the control device is used for monitoring the state of the chip.

18. The board card of claim 17, wherein the board card comprises,

the memory device includes: each group of storage units is connected with the chip through a bus, and the storage units are as follows: DDR SDRAM;

the chip comprises: the DDR controller is used for controlling data transmission and data storage of each storage unit;

the interface device is as follows: standard PCIE interfaces.

19. A method of LSTM operation, the method being applied to a computing device, the LSTM comprising: an input gate, a forget gate, an output gate, and an update status gate, the computing device comprising: the device comprises an operation unit, a controller unit and a storage unit; the storage unit stores: LSTM operator, input data X _t Weight data, output data h _t Input state value C _t-1 Input result h _t-1 Output state value C _t ；

The method comprises the following steps:

the controller unit acquires input data X _t Weight data, input state value C _t-1 Input result h _t-1 LSTM algorithmSub-, input data X _t Weight data, input state value C _t-1 Input result h _t-1 And the LSTM operator to the arithmetic unit,

the operation unit is used for inputting data X _t Weight data, input result h _t-1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate, and the output result is based on the input state value C _t-1 The output result of each gate obtains output data h _t Output state value C _t ；

The arithmetic unit includes: a master processing circuit and a slave processing circuit; the operation unit is used for inputting data X _t Weight data, input result h _t-1 And the LSTM operator executes the operation of the input gate, the operation of the forgetting gate, the operation of the output gate and the operation of the update state gate to obtain the output result of each gate specifically comprises:

20. The method of claim 19, wherein the input state value C is based on _t-1 The output result of each gate obtains an output state value C _t The method specifically comprises the following steps:

the main processing circuit inputs the state value C according to the multiplication operator _t-1 Output result f of forgetting gate _t Multiplying to obtain a first result, and updating the output result g of the state gate according to the multiplication operator _t Output result i of AND input gate _t Multiplying to obtain a second result, and adding the first result and the second result to obtain an output state value C _t 。

21. The method of claim 19, wherein the input state value C is based on _t-1 The output result of each gate obtains output data h _t The method specifically comprises the following steps:

22. The method according to claim 19, wherein the subsequent processing specifically comprises:

23. The method of claim 19, wherein the method further comprises:

the main processing circuit outputs data h _t As the input result of the next moment, the state value C is output _t As the input state value for the next time.

24. The method according to any one of claims 19-23, wherein the number of slave processing circuits is plural, and the arithmetic unit comprises: a tree module, the tree module comprising: a root port and a plurality of branch ports, wherein the root port of the tree module is connected with the main processing circuit, and the plurality of branch ports of the tree module are respectively connected with one of a plurality of auxiliary processing circuits; the method further comprises the steps of:

the tree module forwards data and operators between the master processing circuit and the plurality of slave processing circuits.

25. The method of any of claims 19-23, wherein the arithmetic unit further comprises one or more branch processing circuits, each branch processing circuit coupled to at least one slave processing circuit, as the number of slave processing circuits is plural, the method further comprising:

the branch processing circuit forwards data and operators between the master processing circuit and the plurality of slave processing circuits.

26. The method of any of claims 19-23, wherein the number of slave processing circuits is a plurality, the plurality of slave processing circuits being distributed in an array; each slave processing circuit is connected with other adjacent slave processing circuits, the master processing circuit is connected with k slave processing circuits in the plurality of slave processing circuits, and the k slave processing circuits are: n slave processing circuits of the 1 st row, n slave processing circuits of the m th row, and m slave processing circuits of the 1 st column; the method further comprises the steps of:

The K slave processing circuits are used for processing data and operators between the master processing circuit and the plurality of slave processing circuits.

27. The method of any of claims 19-23, wherein the slave processing circuit comprises: a multiplication processing circuit and an accumulation processing circuit; the method specifically comprises the following steps:

the multiplication processing circuit performs product operation on the element values in the received input data block and the element values at corresponding positions in the weight values of the gates to obtain a product result of the gates; performing product operation on the element value in the received input state data block and the element value in the corresponding position in the weight of each gate to obtain another product result of each gate;

and the accumulation processing circuit performs accumulation operation on the product result of each gate to obtain an intermediate result of each gate, and performs accumulation operation on the other product result of each gate to obtain a state intermediate result of each gate.