WO2019041251A1 - 芯片装置及相关产品 - Google Patents
芯片装置及相关产品 Download PDFInfo
- Publication number
- WO2019041251A1 WO2019041251A1 PCT/CN2017/099991 CN2017099991W WO2019041251A1 WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1 CN 2017099991 W CN2017099991 W CN 2017099991W WO 2019041251 A1 WO2019041251 A1 WO 2019041251A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data block
- unit
- basic
- main unit
- data
- Prior art date
Links
- 238000012545 processing Methods 0.000 claims abstract description 61
- 238000004148 unit process Methods 0.000 claims abstract description 5
- 239000011159 matrix material Substances 0.000 claims description 139
- 238000000034 method Methods 0.000 claims description 51
- 239000013598 vector Substances 0.000 claims description 42
- 238000013528 artificial neural network Methods 0.000 claims description 34
- 238000009825 accumulation Methods 0.000 claims description 16
- 230000008569 process Effects 0.000 claims description 5
- 230000017105 transposition Effects 0.000 claims description 5
- 230000008707 rearrangement Effects 0.000 claims description 4
- 230000001133 acceleration Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 abstract description 25
- 238000004364 calculation method Methods 0.000 description 47
- 238000010586 diagram Methods 0.000 description 20
- 230000004913 activation Effects 0.000 description 12
- 230000005540 biological transmission Effects 0.000 description 7
- 238000011176 pooling Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 6
- 238000010606 normalization Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005265 energy consumption Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000011773 genetically engineered mouse model Methods 0.000 description 3
- 238000012805 post-processing Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3818—Decoding for concurrent execution
- G06F9/3822—Parallel decoding, e.g. parallel decode units
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2218/00—Aspects of pattern recognition specially adapted for signal processing
- G06F2218/02—Preprocessing
Definitions
- the present disclosure relates to the field of communication and chip technologies, and in particular to a chip device and related products.
- ANN Artificial Neural Network
- a neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other.
- the calculation of the existing neural network is based on a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit), and the calculation has high power consumption and long calculation time.
- the embodiments of the present disclosure provide a neural network operation method and related products, which can reduce computation time and reduce power consumption of the module.
- an embodiment of the present disclosure provides a method for computing a neural network, where the method is applied to a chip device, where the chip device includes: a main unit and a plurality of basic units, and the method includes the following steps: acquiring the main unit a data block to be calculated and an operation instruction, according to the operation instruction, dividing the data block to be calculated into a component data block and a broadcast data block; the main unit splitting the distribution data block to obtain a plurality of basic data blocks, Distributing the plurality of basic data blocks to the plurality of basic units, the main unit broadcasting the broadcast data block to the plurality of basic units; and performing, by the basic unit, the basic data blocks and the broadcast data blocks
- the product operation obtains the operation result, and the operation result is sent to the main unit; the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
- the main unit broadcasts the broadcast data block to the multiple basic units, including:
- the master unit broadcasts the broadcast data block to the plurality of basic units by one time.
- the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
- the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, accumulates the inner product processing result to obtain an operation result, and transmits the operation result to the main unit.
- the operation result is a result of the inner product processing
- the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction, including:
- the main unit accumulates the operation result to obtain an accumulation result, and the accumulation result is arranged to obtain the data block to be calculated and the instruction result of the operation instruction.
- the main unit broadcasts the broadcast data block to the multiple basic units, including:
- the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
- the basic unit performs inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, accumulates the inner product processing result to obtain a partial operation result, and sends the partial operation result to The main unit.
- the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sends the operation result to the main unit, including:
- the basic unit multiplexes the partial broadcast data block n times to execute the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and accumulates the n partial processing results to obtain n partial operations.
- the n partial operation results are transmitted to the main unit, and n is an integer greater than or equal to 2.
- a chip device in a second aspect, includes: a main unit and a plurality of basic units, wherein the main unit is configured to acquire a data block to be calculated and an operation instruction, and the to-be-calculated according to the operation instruction Data block partitioning the data block and the broadcast data block; Performing a splitting process to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data blocks to the plurality of basic units; the basic unit, configured to: Performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and transmitting the operation result to the main unit; and the main unit is configured to process the operation result to obtain the to-be-calculated The data block and the instruction result of the operation instruction.
- the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
- the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
- the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
- the result is sent to the main unit.
- the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
- the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
- the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
- the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
- the main unit includes: one or any combination of a main register or a main on-chip buffer circuit;
- the base unit includes one or any combination of a basic register or a basic on-chip buffer circuit.
- the main unit includes one or any combination of a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, or a data rearrangement circuit.
- the unit includes one or any combination of an inner product operator circuit or an accumulator circuit.
- the branch unit is a plurality of branch units, and the main unit is separately connected to the plurality of branch units, and each branch unit is connected to at least one base unit.
- the branch unit is a plurality of branch units, and the plurality of branch units are connected in series and connected to the main unit, and each branch unit is respectively connected to at least one base unit.
- the branching unit is specifically configured to forward data between the primary unit and the basic unit.
- the branching unit is specifically configured to forward data between the primary unit and the basic unit or other branch units.
- the data is: one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
- the operation instruction is a multiplication instruction, determining that the multiplicative data block is a broadcast data block, and the multiplicand data block is a distribution data block;
- the operation instruction is a convolution instruction
- the input data block is a broadcast data block
- the convolution kernel is a distribution data block.
- a method for applying a chip device provided by the second aspect, the chip device for performing one or any combination of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation.
- a chip is provided that integrates the chip arrangement provided by the second aspect.
- a smart device comprising the chip provided by the sixth aspect.
- the data is divided into distribution data and broadcast data, and the distribution data is split into basic data blocks and distributed to a plurality of basic units to perform inner product operations.
- the inner product operation with the largest amount of computation is distributed to a plurality of basic units for simultaneous execution, so that it has the advantages of reducing calculation time and saving power consumption.
- FIG. 1a is a schematic structural diagram of a chip device provided by the present disclosure.
- FIG. 1b is a schematic structural diagram of another chip device provided by the present disclosure.
- FIG. 1c is a schematic diagram of data distribution of the chip device provided by the present disclosure.
- FIG. 1d is a schematic diagram of data back transmission of a chip device.
- FIG. 2 is a schematic flow chart of a method for computing a neural network according to an embodiment of the present disclosure.
- 2a is a schematic diagram of matrix A multiplied by matrix B provided by an embodiment of the present disclosure.
- FIG. 3 is a schematic flowchart diagram of a method for computing a neural network according to an embodiment of the present disclosure.
- Figure 3a is a schematic diagram of single sample data for Full Connection 1.
- Figure 3b is a schematic diagram of multi-sample data for full connection 2.
- Figure 3c is a schematic diagram of M convolution kernel data for convolution 1.
- Figure 3d is a schematic diagram of convolution 2 input data.
- Figure 3e is a schematic diagram of the operation window of a three-dimensional data block of input data.
- Figure 3f is a schematic diagram of another operational window of a three-dimensional data block of input data.
- Figure 3g is a schematic diagram of yet another operational window of a three-dimensional data block of input data.
- references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present disclosure.
- the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will understand and implicitly understand that the embodiments described herein can be combined with other embodiments.
- the CPU is taken as an example to illustrate the operation method of the neural network.
- the multiplication of the matrix and the matrix is widely used in the neural network.
- the step taken in calculating C can be to first calculate the completion of the first line, then complete the calculation for the second line, and finally complete the operation for the third line, that is, one row of data for the CPU. After the calculation is completed, the calculation of the second line of data is performed.
- the CPU completes the calculation on the first line, that is, it needs to be completed, a 11 *b 11 +a 12 *b 21 +a 13 *b 31 , a 11 *b 12 +a 12 * b 22 +a 13 *b 32 and a 11 *b 13 +a 12 *b 23 +a 13 *b 33 ;
- the CPU or GPU it needs to be calculated line by line, that is, after the first line is calculated, the second line is calculated, and then the third line is calculated until all the lines are calculated.
- the number of rows may have thousands of rows of data, so the calculation time is very long, and in the calculation, the CPU is in a working state for a long time, and the energy consumption is also high.
- FIG. 1b is a schematic structural diagram of a chip device, as shown in FIG. 1b.
- the device includes: a main unit circuit, a basic unit circuit, and a branch unit circuit.
- the main unit circuit may include a register and/or an on-chip buffer circuit, and the main unit may further include: a vector operator circuit, an ALU (arithmetic and logic unit) circuit, an accumulator circuit, a matrix transposition circuit, and a DMA.
- ALU arithmetic and logic unit
- each base unit may include a base register and/or a base on-chip buffer circuit; each base unit may further include: an inner product operation One or any combination of a circuit, a vector operator circuit, an accumulator circuit, and the like.
- the circuits can all be integrated circuits. If there is a branch unit, wherein the main unit is connected to the branch unit, the branch unit is connected to the basic unit for performing an inner product operation between the data blocks, the main unit for transmitting and receiving external data, and the external unit The data is distributed to a branch unit for transmitting and receiving data of the main unit or the base unit.
- the structure shown in Figure 1b is suitable for the calculation of complex data, because for the main unit, the number of connected units is limited, so it is necessary to add branch units between the main unit and the basic unit to achieve more basic unit connections. Into, to achieve the calculation of complex data blocks.
- connection structure of the branch unit and the base unit may be arbitrary, and is not limited to the H-type structure of FIG. 1b.
- the primary unit to the base unit is a structure for broadcasting or distribution
- the base unit to the main unit is a structure of a gather.
- the definitions of broadcasting, distribution and collection are as follows:
- the data transfer manner of the main unit to the base unit may include:
- the main unit is connected to a plurality of branch units, and each branch unit is connected to a plurality of base units.
- the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected to a plurality of base units.
- the main unit is connected to a plurality of branch units, and each branch unit is connected in series with a plurality of base units.
- the main unit is connected to a branch unit, which is connected to a branch unit, and so on, and a plurality of branch units are connected in series, and then each branch unit is connected in series with a plurality of base units.
- the main unit When distributing data, the main unit transmits data to some or all of the base units, and the data received by the base unit of each received data may be different;
- the main unit When broadcasting data, the main unit transmits data to some or all of the base units, and the base unit of each received data receives the same data.
- the chip device shown in FIG. 1a or FIG. 1b may be a single physical chip. Of course, in practical applications, the chip device may also be integrated in other chips (for example, CPU, GPU). The specific embodiment does not limit the physical representation of the above chip device.
- FIG. 1c is a schematic diagram of data distribution of a chip device. As shown by the arrow in FIG. 1c, the arrow is a data distribution direction. As shown in FIG. 1c, after the main unit receives the external data, the external data is removed. After the distribution, the distribution is distributed to a plurality of branch units, and the branch unit transmits the split data to the base unit.
- FIG. 1d is a schematic diagram of data back transmission of a chip device. As shown by the arrow in FIG. 1d, the arrow is the data return direction, as shown in FIG. 1d, the basic unit will data (for example, the inner product calculation result). ) is passed back to the branch unit, which is passed back to the main unit.
- data for example, the inner product calculation result
- FIG. 1a is a schematic structural diagram of another chip device.
- the chip device includes a main unit and a basic unit, and the main unit is connected to the basic unit. Since the structure shown in Fig. 1a is directly physically connected to the main unit, the number of basic units connected to the structure is limited, which is suitable for simple data calculation.
- FIG. 2 provides a method for computing a neural network using the above chip device.
- the method is implemented by using a chip device as shown in FIG. 1a or as shown in FIG. 1b.
- the method is as shown in FIG. 2, and includes the following steps. :
- Step S201 The main unit of the chip device acquires a data block to be calculated and an operation instruction.
- the data block to be calculated in the above step S201 may be specifically a matrix, a vector, a three-dimensional data, a four-dimensional data, a multi-dimensional data, and the like.
- the specific embodiment of the present disclosure does not limit the specific expression of the data block, and the operation instruction may specifically For, multiplication instructions, convolution instructions, addition instructions, subtraction instructions, BLAS (English: Basic Linear Algebra Subprograms) functions or activation functions, and so on.
- Step S202 The main unit divides the data block to be calculated into the component data block and the broadcast data block according to the operation instruction.
- the implementation method of the foregoing step S202 may specifically be:
- the multiplier data block is determined to be a broadcast data block, and the multiplicand data block is determined. To distribute data blocks.
- the operation instruction is a convolution instruction
- the input data block is a broadcast data block
- the convolution kernel is a distribution data block.
- Step S2031 The main unit performs split processing on the distributed data block to obtain a plurality of basic data blocks, and distributes the plurality of basic data blocks to multiple basic units.
- step S2032 the main unit broadcasts the broadcast data block to a plurality of basic units.
- step S2031 and step S2032 may also be performed in a loop.
- the main unit splits the distributed data block to obtain a plurality of basic data blocks, and each basic data block is split.
- the broadcast data block is also split into m broadcast data sub-blocks, the main unit distributes one basic data sub-block at a time and broadcasts a broadcast data sub-block, the basic data sub-block and broadcast data.
- Sub-blocks are all data blocks that are capable of performing parallel neural network calculations.
- the basic data block may be the z-th row data of the matrix A, and the basic data sub-block may be the front of the matrix A in the z-th row data.
- the broadcast data sub-block may be the first 20 rows of data in the z-th column of matrix B.
- the basic data block in the above step S203 may specifically be a minimum data block capable of performing an inner product operation.
- the basic data block may be a row of data of a matrix.
- the basic data block may be Is the weight of a convolution kernel.
- step S203 For the manner of the foregoing step S203, refer to the description of the following embodiments, and details are not described herein again.
- step S203 For the method for broadcasting the broadcast data block, refer to the description of the following embodiments, and details are not described herein again.
- Step S2041 The basic unit of the chip device performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result (possibly an intermediate result).
- Step S2042 If the operation result is not an intermediate result, the operation result is transmitted back to the main unit.
- Step S205 The main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
- the processing in the foregoing step S205 may be an accumulation, a sort, or the like.
- the disclosure is not limited to the specific manner of the foregoing processing.
- the specific manner needs to be configured according to different operation instructions, for example, may also include performing a nonlinear transformation or the like.
- the technical solution provided by the present disclosure receives external data by the main unit, and the external data includes a data block to be calculated and an operation instruction, acquires a data block to be calculated, and an operation instruction, and determines the to-be-calculated according to the operation instruction.
- the distribution data block of the data block and the broadcast data block split the distribution data block into a plurality of basic data blocks, broadcast the broadcast data block to a plurality of basic units, and distribute the plurality of basic data blocks to the plurality of basic units, and multiple
- the basic unit performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and the plurality of basic units return the operation result to the main unit, and the main unit obtains the instruction result of the operation instruction according to the returned operation result.
- the technical point of this technical solution is that for the neural network, the large amount of computation lies in the inner product operation between the data block and the data block, the overhead of the inner product operation is large, and the calculation time is long, so the embodiment of the present disclosure passes the operation.
- the instruction and the instruction to be operated first distinguish the distribution data block and the broadcast data block in the data block to be calculated, and for the broadcast data block, the data block that must be used when implementing the inner product operation, and for the distribution data block,
- the matrix multiplication is taken as an example.
- the data block to be calculated is matrix A and matrix B
- the operation instruction is a multiplication instruction (A*B), which is determined according to the rule of matrix multiplication.
- the matrix A is a splittable data block, and the matrix B is determined to be a broadcast data block, because for matrix multiplication, the multiplicand matrix A can be split into multiple basic data blocks, and the multiplier matrix B can be broadcast. data block.
- the multiplicand matrix A needs to perform inner product operations on each row of data and the multiplier matrix B respectively. Therefore, the technical solution of the present application divides the matrix A into M basic data blocks, each of the M basic data blocks.
- the basic data block can be a row of data of matrix A. Therefore, for matrix multiplication, the time-consuming operation time is performed by multiple basic units separately. Therefore, in the inner product operation, multiple basic units can quickly calculate the result in parallel, thereby reducing the calculation time and less. The calculation time can also reduce the operating time of the chip device, thereby reducing power consumption.
- a matrix A is multiplied by a vector B.
- the matrix A has M rows, L columns, and the vector B has L rows. It is assumed that the operator operates a row of the matrix A and the vector B.
- the matrix A is split into M basic data blocks, and each basic data block is a row of data of the matrix A, and M basic units.
- the calculation time is t1
- t2 can be the time for the main unit to split the data
- t3 can be the operation for processing the inner product operation.
- the chip device provided by the present disclosure has a short working time, and experimentally proves that the working time of the chip device In a very short time, its energy consumption will be much lower than the long working hours, so it has the advantage of saving energy.
- the main unit can broadcast the broadcast data block to the multiple basic units in multiple manners.
- Mode A Broadcasting the data block to the plurality of basic units by one time.
- the broadcast refers to performing "one-to-many" data transmission, that is, the main unit simultaneously transmits the same data block to a plurality of (all or part of) base units), for example, a matrix A* matrix B, where matrix B is a broadcast
- the data block broadcasts the matrix B to the plurality of basic units by one time.
- the input data is a broadcast data block
- the input data block is broadcast to the plurality of basic units at one time.
- Method B dividing the broadcast data block into a plurality of partial broadcast data blocks, and broadcasting the plurality of partial broadcast data blocks to the plurality of basic units by multiple times, for example, the matrix B is broadcasted to the plurality of basic units by multiple times, specifically Each time the matrix N of the matrix B is broadcast.
- the advantage of this method is that the configuration of the basic unit can be reduced, because the storage space of the registers configured for the basic unit is unlikely to be large, and if the matrix B is sent to the basic unit once for the matrix B with a relatively large amount of data, then the basic The storage of these data requires a relatively large register space. Because the number of basic units is large, increasing the register space inevitably has a great impact on the cost increase. Therefore, the scheme of broadcasting the broadcast data block multiple times is used, that is, for the basic unit. In other words, it only needs to store part of the data of the broadcast data block that is broadcasted each time, thereby reducing the cost.
- the foregoing method for distributing a plurality of basic data blocks to a plurality of basic units in step S203 may also adopt the above manner A or mode B, except that the transmission mode is a unicast party. And the data transmitted is the basic data block.
- the implementation method of the foregoing step S204 may specifically be:
- the basic unit performs inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, that is, one execution.
- An inner product operation of one line, the inner product processing result (one of the operation results) is sent to the main unit, and the main unit accumulates the inner product processing result.
- the basic unit may accumulate the inner product processing result. After that, the accumulated result (another one of the operation results) is sent to the main unit.
- the above method can reduce the amount of data transmission between the main unit and the basic unit, thereby increasing the speed of calculation.
- the basic unit performs a partial inner product operation of the basic data block and the partial broadcast data block to obtain a partial processing result, and the basic unit, every time the basic unit receives the partial broadcast data block.
- the processing result is sent to the main unit, which accumulates the processing result.
- the basic unit receives n basic data blocks, multiplexing the broadcast data block to perform the inner data operation of the broadcast data block and the n basic data blocks to obtain n partial processing results, and basically The unit transmits the n processing results to the main unit, and the main unit accumulates the n processing results respectively.
- the above accumulation can also be performed in the basic unit.
- the amount of data of the broadcast data block is generally very large and the distribution data block is also large, because for the chip device, since it belongs to the configuration of the hardware, the basic unit of the configuration may be innumerable in theory, but In practice, the number is limited, generally tens of basic units, which may change constantly as technology develops, such as increase.
- the number of rows of the matrix A may have thousands of rows, and the number of columns of the matrix B also has thousands of columns, so that once the broadcast data is sent to the basic unit, the matrix B cannot be realized.
- the implementation manner may be that part of the data of the broadcast matrix B is broadcasted once, for example, the first five columns of data, and a similar manner may be adopted for the matrix A, and for the basic unit, the partial inner product calculation may be performed each time. Then, the result of the partial inner product calculation is stored in the register, and after all the inner product operations of the row are executed, the result of the inner product calculation of all the rows of the row is accumulated to obtain an operation result, and the operation result is obtained. Send to the main unit.
- This approach has the advantage of increasing the speed of calculation.
- FIG. 3 provides a calculation method of a neural network, and the calculation in the embodiment is based on a moment.
- the calculation method of the matrix A* matrix B indicates that the matrix A* matrix B can be a matrix diagram shown in FIG. 3a.
- the calculation method of the neural network shown in FIG. 3 is as shown in FIG. 1b.
- the chip device has 16 basic units.
- the value of M as shown in FIG. 3a may be 32, and the value of N may be 15.
- the value of L can be 20. It will of course be understood that the computing device can have any number of basic units.
- the method is shown in Figure 3 and includes the following steps:
- Step S301 the main unit receives the matrix A, the matrix B, and the multiplication operation instruction A*B.
- Step S302 the main unit determines that the matrix B is a broadcast data block according to the multiplication operation instruction A*B, the matrix A is a distribution data block, and the matrix A is divided into 32 basic data blocks, and each basic data block is a row of data of the matrix A. .
- Step S303 the main unit uniformly allocates 32 basic data blocks to 16 basic units, and evenly allocates 32 basic data blocks to 16 basic units, that is, each basic unit receives 2 basic data blocks, and the two data blocks
- the allocation method can be any non-repeating allocation order.
- the allocation method of the foregoing step S303 may adopt some other allocation methods.
- the database may be unevenly allocated to each basic unit; some of them may not be equally divided.
- the embodiment of the present disclosure does not limit the manner in which the above basic data blocks are allocated to a plurality of basic units.
- Step S304 the main unit extracts partial data of the first few columns (such as the first five columns) of the matrix B, and the matrix B broadcasts part of the data of the first five columns to 16 basic units.
- Step S305 the 16 basic units secondarily multiplex the partial data of the first 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 pre-processing results, and send 32*5 pre-processing results to Main unit.
- Step S306 the main unit extracts part of the data of the five columns of the matrix B, and the matrix B broadcasts the partial data of the five columns to the 16 basic units.
- Step S307 the 16 basic units secondarily multiplex the partial data of the 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 processing results, and send 32*5 processing results to Main unit.
- Step S308 the main unit extracts partial data of the last five columns of the matrix B, and the matrix B has the last five columns. Part of the data is broadcast to 16 basic units.
- Step S309 the 16 basic units secondarily multiplex the partial data of the last 5 columns and the 2 basic data blocks to perform the inner product operation and the accumulation operation to obtain 32*5 post-processing results, and send 32*5 post-processing results to the Main unit.
- Step S310 the main unit combines 32*5 pre-processing results, 32*5 processing results, and 32*5 post-processing results in front, middle, and back to obtain a 32*15 matrix C, the matrix C. That is, the result of the instruction of the matrix A* matrix B.
- the technical solution shown in FIG. 3 splits the matrix A into 32 basic data blocks, and then broadcasts the matrix B in batches, so that the basic unit can obtain the instruction result in batches, since the inner product is split into 16 basics.
- the unit is calculated, so the calculation time can be greatly reduced, so it has the advantages of short calculation time and low energy consumption.
- FIG. 1a is a chip device according to the disclosure, the chip device includes: a main unit and a basic unit, the main unit is a hardware chip unit, and the basic unit is also a hardware chip unit;
- the main unit is configured to perform each successive operation in a neural network operation and transmit data with the basic unit;
- the basic unit is configured to perform an operation of parallel acceleration in the neural network according to the data transmitted by the main unit, and transmit the operation result to the main unit.
- the above parallel accelerated operations include, but are not limited to, multiplication operations between data blocks and data blocks, convolution operations, and the like, which are large-scale and parallelizable.
- Each of the above consecutive operations includes, but is not limited to, a continuous operation such as an accumulation operation, a matrix transposition operation, a data sort operation, and the like.
- the main unit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data block to the plurality of basic units; the basic unit And performing an inner product operation on the basic data block and the broadcast data block to obtain an operation result, and sending the operation result to the main unit; the main unit is configured to be used for the operation The result processing obtains the data block to be calculated and the instruction result of the operation instruction.
- the chip device further includes: a branching unit, the branching unit is disposed between the main unit and the basic unit; and the branching unit is configured to forward data.
- the main unit is specifically configured to broadcast the broadcast data block to the plurality of basic units by one time.
- the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain an operation result, where the operation is performed.
- the result is sent to the main unit.
- the main unit is configured to, after the operation result is the result of the inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and The result of the instruction of the operation instruction.
- the main unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units by multiple times.
- the basic unit is specifically configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result to obtain a partial operation result, Transmitting the partial operation result to the main unit.
- the basic unit is specifically configured to multiplex the part of the broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and the n partial processing results are performed. After accumulating separately, n partial operation results are obtained, and the n partial operation results are sent to the main unit, and n is an integer greater than or equal to 2.
- the specific embodiment of the present disclosure further provides an application method of the chip device as shown in FIG. 1a, where the application method may be specifically used to perform one of a matrix multiplication matrix operation, a matrix multiplication vector operation, a convolution operation, or a full connection operation. Or any combination.
- the main unit may also perform a pooling operation, a regularization (normalization) operation, and a neural network operation step such as batch normalization, lrn.
- a specific embodiment of the present application also provides a chip including the chip device as shown in FIG. 1a or as shown in FIG.
- the specific implementation of the present application further provides a smart device, where the smart device includes the foregoing chip,
- the chip integrates a chip arrangement as shown in Figure 1a or as shown in Figure 1b.
- the smart device includes, but is not limited to, a smart device, a tablet computer, a personal digital assistant, a smart watch, a smart camera, a smart TV, a smart refrigerator, and the like.
- the foregoing device is for illustrative purposes only, and the specific embodiment of the present application is not limited to the above. The specific form of the device.
- the input data of the fully connected layer is a vector of length L (such as the vector B in the fully connected 1-single sample shown in Figure 3a) (ie, the input of the neural network is a single sample)
- the output of the fully connected layer Is a vector of length M
- the weight of the fully connected layer is an M*L matrix (such as "Matrix A in Figure 3b Fully Connected 1-Single Sample")
- the weight matrix of the fully connected layer is used as the matrix A (ie, splitting the data block)
- the input data is used as the vector B (ie, the broadcast data block)
- the operation is performed in accordance with the method 1 shown in FIG.
- the specific operation method can be:
- the input data of the fully connected layer is a matrix (that is, the input of the neural network is a case where multiple samples are operated together as a batch)
- the input data of the fully connected layer represents N input samples, each sample is a length L Vector
- the input data is represented by a matrix of L*N, as shown by the matrix B in " Figure 3b Fully Connected 1-Multiple Samples”.
- the output of the fully connected layer for each sample is a vector of length M, then all
- the output data of the connection layer is an M*N matrix, such as the result matrix in "Fig.
- the convolution layer When using the chip device for artificial neural network operation, the convolution layer, the pooling layer, and the regularization layer (also called normalization layer, such as BN (Batch normalization) or LRN (Local Response Normalization)) in the neural network, etc.
- the main unit uses the data rearrangement circuit of the main unit for each sample of the input data, and places the input data in a certain order.
- the order may be Arbitrary order;
- the sequence will place input data such as NHWC and NWHC in the fastest way to change the C-dimensional coordinates represented by the above schematic diagram.
- C is the dimension of the innermost layer of the data block
- N is the dimension of the outermost layer of the data block
- H and W are the dimensions of the middle layer.
- H and W are the relevant operational window sliding dimensions for convolution and pooling operations (an example of the sliding of the operation window in the W dimension is shown in Figure 3e Convolution 3-Sliding a" and " Figure 3f Convolution 3 - Slide b"
- Figure 3g the size of the operation window and the size of a convolution kernel in the M convolution kernels
- Figure 3c M convolution kernels
- each convolution kernel is a 5*3*3 three-dimensional data block
- its operation window is also a 5*3*3 three-dimensional data block, as shown in Figure 3c.
- the KH and KW in the M convolution kernels shown indicate that the dimension corresponding to its KH is the H dimension of the input data, and the corresponding dimension represented by the KW is the W dimension of the input data.
- the gray part of the graph in Figures 3e, 3f, and 3g It is data used for calculation in each sliding operation window, and the direction of sliding may be H as the sliding direction and then after W is the sliding direction or W is the sliding direction, and then H is the sliding direction.
- the operation at each sliding window is the data represented by the gray part of the figure.
- 3c" are respectively subjected to an inner product operation, and the convolution will output a value corresponding to each convolution kernel for each sliding window position, that is, for each The sliding window has M output values; for pooling, the operation at each sliding window is the data block represented by the gray square in the figure in the H and W dimensions (in the example in the figure, the gray data block is in the same In the 9 numbers on a plane, the maximum value is selected, or the average value is calculated.
- the pooling will output C values for each sliding window position.
- C is a single sample of the 3D data block except H and W. Another dimension, N represents a total of N samples simultaneously The operation of this layer.
- the C dimension is defined as: each basic LRN operation selects a continuous data block along the C dimension (ie, a data block of Y*1*1), where Y*1*1 Y in the data block is the value in the C dimension, the value of Y is less than or equal to the maximum value of the C dimension, the first 1 represents the H dimension, and the second 1 represents the W dimension; the remaining two dimensions are defined as The H and W dimensions, that is, for each of the three-dimensional data blocks of each sample, each time an LRN regularization operation is performed, a continuous portion of data in the same W coordinate and different C coordinates in the same H coordinate is performed.
- the regularization algorithm BN all the values of the coordinates in the same C dimension in the three-dimensional data block of the N samples are averaged and variance (or standard deviation).
- a square is used to represent a numerical value, which can also be called a weight; the numbers used in the schematic diagram are limited to examples.
- the dimensional data may be any numerical value (including some The case where the dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block.
- the input data is a three-dimensional data block; for example, when the volume In the case where the number of cores is 1, the convolution and the data are a three-dimensional data block).
- each convolution kernel For a convolutional layer, its weight (all convolution kernels) is as shown in "Figure 3c Convolution 1 - Convolution Kernel", the number of convolution kernels is M, and each convolution kernel consists of C KHs.
- the matrix of the KW column is composed, so the weight of the convolution layer can be expressed as a four-dimensional data block with four dimensions of M, C, KH, and KW respectively; the input data of the convolutional layer is a four-dimensional data block, and N three-dimensional data blocks.
- Data block composition each of the three-dimensional data blocks is composed of C H rows and W columns of feature matrices (ie, four dimensions are N, C, H, W data blocks); such as "Figure 3d convolution 2-input The data is shown.
- each convolution kernel can be a basic data block.
- the basic data block can also be changed to a smaller temperature, such as a planar matrix of a convolution kernel.
- the convolution kernel weight set distributed to the i-th base unit is Ai, which has a total of 46 convolution kernels.
- the i-th base unit the received unit is distributed by the main unit.
- the convolution kernel weight Ai is stored in its register and / Or in the on-chip cache; the parts of the input data (ie, the sliding window as shown in FIG. 3e, FIG. 3f or as shown in FIG. 3g) are transmitted to each basic unit in a broadcast manner (the above-mentioned manner of broadcasting may be in the above manner A or mode B)
- the weight of the operation window can be broadcast to all the basic units by means of multiple broadcasts.
- the weight of the partial operation window can be broadcast each time, for example, each time a plane matrix is broadcasted, As shown in FIG. 3e, a C-plane KH*KW matrix can be broadcast each time.
- data of the first n rows or the first n columns in a C-plane KH*HW matrix can also be broadcast at a time.
- the method of transmitting the partial data and the arrangement of the partial data are not limited; the placement mode of the input data is converted into the arrangement mode of the arbitrary dimension order, and then the input data of each part is sequentially broadcast to the base unit in sequence.
- the foregoing distribution data may also be sent in a manner similar to the operation window of the input data, and details are not described herein again.
- the input data is converted into a loop in which C is the innermost layer. The effect of this is that the data of C is twisted together, thereby increasing the degree of parallelism of the convolution operation, and making it easier to perform parallel operations on multiple feature maps.
- the manner in which the input data is placed is converted into a layout order in which the NHWC or the NWHC is placed in each of the basic units, for example, the i-th base unit, and the convolution kernel in the weight Ai and the received broadcast are calculated.
- the inner product of the corresponding part of the data (ie, the operation window); the data of the corresponding part of the weight Ai can be read directly from the on-chip buffer, or can be read into the register for multiplexing.
- the result of the inner product of each base unit is accumulated and transmitted back to the main unit.
- the portion obtained by performing the inner product operation each time the base unit is transferred to the main unit for accumulation; the portion obtained by the inner product operation performed by each base unit may be stored in the register and/or the on-chip buffer of the base unit, and may be accumulated.
- the part obtained by the inner product operation performed by each basic unit and the partial and partial storage in the register and/or on-chip buffer of the base unit may be accumulated, and in some cases, transmitted to the main unit. Accumulate, transfer back to the main unit after the accumulation is completed.
- GEMM GEMM calculation refers to the operation of matrix-matrix multiplication in the BLAS library.
- auxiliary integers as parameters to explain the width and height of the matrix A and B;
- the input matrix A and the matrix B are respectively subjected to respective op operations; the op operation may be a transposition operation of the matrix, and of course, other operations, such as non-linear function operations, pooling, and the like.
- the matrix op operation is implemented by using the vector operation function of the main unit; if the op of a certain matrix can be empty, the main unit does not perform any operation on the matrix;
- GEMV calculation refers to the operation of matrix-vector multiplication in the BLAS library.
- Corresponding op operation is performed on the input matrix A; the chip device uses the method shown in FIG. 2 to complete the matrix-vector multiplication calculation between the matrix op(A) and the vector B; using the vector operation function of the main unit, on the op ( A) Each of the results of *B is multiplied by alpha; the vector operation function of the main unit is used to implement the step of adding the corresponding positions between the matrices alpha*op(A)*B and beta*C.
- An activation function usually refers to performing a nonlinear operation on each of a data block (which can be a vector or a multidimensional matrix).
- the chip device uses the vector calculation function of the main unit to input a vector. Calculate the activation vector of the vector; the main unit passes each value in the input vector through an activation function (a value when the input of the activation function is used, and the output is also a value), and calculates a value output to the corresponding position of the output vector;
- the source of the above input vector includes, but is not limited to, external data of the chip device, and calculation result data of the basic unit forwarded by the branch unit of the chip device.
- the calculation result data may specifically be an operation result of performing a matrix multiplication vector; the calculation result data may further perform a calculation result of the matrix multiplication matrix; and the input data may be a calculation result after the offset is added to the main unit.
- the function of adding two vectors or two matrices can be realized by using the main unit; the main unit can be used to add a vector to each line of a matrix, or the function on each column.
- the matrix may be from the device performing a matrix multiplication matrix operation; the matrix may be from the device performing a matrix multiplication vector operation; the matrix may be externally accepted from the main unit of the device data.
- the vector may be from data accepted externally by the main unit of the device.
- the disclosed apparatus may be implemented in other ways.
- the device embodiments described above are merely illustrative.
- the division of the unit is only a logical function division, and the actual implementation may have another division manner. Multiple units or components may be combined or integrated into another system, or some features may be omitted or not implemented.
- the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, device or unit, and may be electrical or otherwise.
- each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
- the above integrated units/modules are implemented in the form of hardware.
- the hardware can be a circuit, including a digital circuit, an analog circuit, and the like.
- Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like.
- the computing modules in the computing device can be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like.
- the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, and the like.
- the described units may or may not be physically separate, ie may be located in one place, or may be distributed over multiple network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Neurology (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Multi Processors (AREA)
- Complex Calculations (AREA)
- Image Processing (AREA)
- Advance Control (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Mobile Radio Communication Systems (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
Claims (22)
- 一种芯片装置,其特征在于,所述芯片装置包括:主单元以及多个基本单元,所述主单元为硬件芯片单元,所述基本单元也为硬件芯片单元;所述主单元,用于执行神经网络运算中的各个连续的运算以及与所述基本单元传输数据;所述基本单元,用于依据所述主单元传输的数据执行神经网络中并行加速的运算,并将运算结果传输给所述主单元。
- 根据权利要求1所述的芯片装置,其特征在于,所述主单元,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述多个基本单元,将所述广播数据块广播至所述多个基本单元;所述基本单元,用于对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主单元;所述主单元,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
- 根据权利要求2所述的芯片装置,其特征在于,所述芯片装置还包括:分支单元,所述分支单元设置在主单元与至少一个基本单元之间;所述分支单元,用于在主单元和上述至少一个基本单元之间转发数据。
- 根据权利要求2或3所述的芯片装置,其特征在于,所述主单元,具体用于将所述广播数据块通过一次广播至所述多个基本单元。
- 根据权利要求4所述的芯片装置,其特征在于,所述基本单元,具体用于将所述基本数据块与所述广播数据块执行内积处理得到内积处理结果,将所述内积处理结果累加得到运算结果,将所述运算结果发送至所述主单元。
- 根据权利要求4所述的芯片装置,其特征在于,所述主单元,用于在如所述运算结果为内积处理的结果时,对所述运算结 果累加后得到累加结果,将该累加结果排列得到所述待计算的数据块以及运算指令的指令结果。
- 根据权利要求2或3所述的芯片装置,其特征在于,所述主单元,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述多个基本单元。
- 根据权利要求7所述的芯片装置,其特征在于,所述基本单元,具体用于将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主单元。
- 根据权利要求8所述的芯片装置,其特征在于,所述基本单元,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主单元,所述n为大于等于2的整数。
- 根据权利要求1所述的芯片装置,其特征在于,所述主单元包括:主寄存器或主片上缓存电路的一种或任意组合;所述基础单元包括:基本寄存器或基本片上缓存电路的一种或任意组合。
- 根据权利要求10所述的芯片装置,其特征在于,所述主单元包括:向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路或数据重排电路中的一种或任意组合。
- 根据权利要求10或11所述的芯片装置,其特征在于,所述单元包括:内积运算器电路或累加器电路等中一个或任意组合。
- 根据权利要求2所述的芯片装置,所述分支单元为多个分支单元,所述主单元与所述多个分支单元分别连接,每个分支单元与至少一个基础单元连接。
- 根据权利要求2所述的芯片装置,所述分支单元为多个分支单元,所述多个分支单元串联连接后与所述主单元连接,每个分支单元分别连接至少一个基础单元。
- 根据权利要求13所述的芯片装置,其特征在于,所述分支单元具体用于转发所述主单元与所述基础单元之间的数据。
- 根据权利要求14所述的芯片装置,其特征在于,所述分支单元具体用于转发所述主单元与所述基础单元或其他分支单元之间的数据。
- 根据权利要求1所述的芯片装置,其特征在于,所述数据为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。
- 根据权利要求2所述的芯片装置,其特征在于,如所述运算指令为乘法指令,确定乘数数据块为广播数据块,被乘数数据块为分发数据块;如所述运算指令为卷积指令,确定输入数据块为广播数据块,卷积核为分发数据块。
- 一种芯片,其特征在于,所述芯片集成如权利要求1-18任意一项所述的芯片装置。
- 一种智能设备,其特征在于,所述智能设备包括如权利要求19所述的芯片。
- 一种神经网络的运算方法,其特征在于,所述方法应用在芯片装置内,所述芯片装置包括:主单元以及至少一个基础单元,所述方法包括如下步骤:所述主单元执行神经网络运算中的各个连续的运算以及与所述基础单元传输数据;所述基础单元依据所述主单元传输的数据执行神经网络中并行加速的运算,并将运算结果传输给所述主单元。
- 根据权利要求21所述的方法,其特征在于,所述方法具体包括:所述主单元获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;主单元对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述至少一个基础单元,所述主单元将所述广播数据块广播至所述至少一个基础单元;所述基础单元对所述基本数据块与所述广播数据块执行内积运算得到运算结果,将所述运算结果发送至主单元;主单元对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果。
Priority Applications (28)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910102972.4A CN109902804B (zh) | 2017-08-31 | 2017-08-31 | 一种池化运算方法及装置 |
CN201780002287.3A CN109729734B8 (zh) | 2017-08-31 | 2017-08-31 | 芯片装置及相关产品 |
CN202010628834.2A CN111860815A (zh) | 2017-08-31 | 2017-08-31 | 一种卷积运算方法及装置 |
JP2019553977A JP7065877B2 (ja) | 2017-08-31 | 2017-08-31 | チップ装置および関連製品 |
CN201910530860.9A CN110245751B (zh) | 2017-08-31 | 2017-08-31 | 一种gemm运算方法及装置 |
EP19212368.5A EP3654210A1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related products |
KR1020197029020A KR102467688B1 (ko) | 2017-08-31 | 2017-08-31 | 칩 장치 및 관련 제품 |
CN201910531031.2A CN110222308B (zh) | 2017-08-31 | 2017-08-31 | 一种矩阵乘矩阵运算方法及装置 |
CN201910534118.5A CN110231958B (zh) | 2017-08-31 | 2017-08-31 | 一种矩阵乘向量运算方法及装置 |
CN201910534528.XA CN110245752B (zh) | 2017-08-31 | 2017-08-31 | 一种使用芯片装置进行全连接运算方法及装置 |
EP19212002.0A EP3651031A1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related products |
CN201910534527.5A CN110083390B (zh) | 2017-08-31 | 2017-08-31 | 一种gemv运算运算方法及装置 |
EP19212365.1A EP3654209A1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related products |
PCT/CN2017/099991 WO2019041251A1 (zh) | 2017-08-31 | 2017-08-31 | 芯片装置及相关产品 |
EP17923228.5A EP3605402B1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related product |
EP19211995.6A EP3651030A1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related products |
KR1020197037903A KR102477404B1 (ko) | 2017-08-31 | 2017-08-31 | 칩 장치 및 관련 제품 |
KR1020197037895A KR102481256B1 (ko) | 2017-08-31 | 2017-08-31 | 칩 장치 및 관련 제품 |
CN201811462676.7A CN109615061B (zh) | 2017-08-31 | 2017-08-31 | 一种卷积运算方法及装置 |
EP19212010.3A EP3654208A1 (en) | 2017-08-31 | 2017-08-31 | Chip device and related products |
TW107125681A TWI749249B (zh) | 2017-08-31 | 2018-07-25 | 芯片裝置、芯片、智能設備以及神經網絡的運算方法 |
US16/168,778 US11409535B2 (en) | 2017-08-31 | 2018-10-23 | Processing device and related products |
US16/663,174 US11775311B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
US16/663,181 US11561800B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
US16/663,206 US11334363B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
US16/663,205 US11347516B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
US16/663,210 US11354133B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
US16/663,164 US11531553B2 (en) | 2017-08-31 | 2019-10-24 | Processing device and related products |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/099991 WO2019041251A1 (zh) | 2017-08-31 | 2017-08-31 | 芯片装置及相关产品 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/168,778 Continuation US11409535B2 (en) | 2017-08-31 | 2018-10-23 | Processing device and related products |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019041251A1 true WO2019041251A1 (zh) | 2019-03-07 |
Family
ID=65436282
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2017/099991 WO2019041251A1 (zh) | 2017-08-31 | 2017-08-31 | 芯片装置及相关产品 |
Country Status (7)
Country | Link |
---|---|
US (7) | US11409535B2 (zh) |
EP (6) | EP3651031A1 (zh) |
JP (1) | JP7065877B2 (zh) |
KR (3) | KR102477404B1 (zh) |
CN (8) | CN110222308B (zh) |
TW (1) | TWI749249B (zh) |
WO (1) | WO2019041251A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111126582A (zh) * | 2019-12-20 | 2020-05-08 | 上海寒武纪信息科技有限公司 | 数据处理方法和相关产品 |
CN111161705A (zh) * | 2019-12-19 | 2020-05-15 | 上海寒武纪信息科技有限公司 | 语音转换方法及装置 |
CN113743598A (zh) * | 2020-05-27 | 2021-12-03 | 杭州海康威视数字技术股份有限公司 | 一种ai芯片的运行方式的确定方法和装置 |
CN114936633A (zh) * | 2022-06-15 | 2022-08-23 | 北京爱芯科技有限公司 | 用于转置运算的数据处理单元及图像转置运算方法 |
Families Citing this family (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992743B (zh) * | 2017-12-29 | 2020-06-16 | 华为技术有限公司 | 矩阵乘法器 |
CN116991226A (zh) * | 2018-02-14 | 2023-11-03 | 上海寒武纪信息科技有限公司 | 处理器的控制装置、方法及设备 |
CN110210610B (zh) * | 2018-03-27 | 2023-06-20 | 腾讯科技(深圳)有限公司 | 卷积计算加速器、卷积计算方法及卷积计算设备 |
US11277455B2 (en) | 2018-06-07 | 2022-03-15 | Mellanox Technologies, Ltd. | Streaming system |
US20200106828A1 (en) * | 2018-10-02 | 2020-04-02 | Mellanox Technologies, Ltd. | Parallel Computation Network Device |
CN110162799B (zh) * | 2018-11-28 | 2023-08-04 | 腾讯科技(深圳)有限公司 | 模型训练方法、机器翻译方法以及相关装置和设备 |
US11175946B2 (en) * | 2018-12-06 | 2021-11-16 | Advanced Micro Devices, Inc. | Pipelined matrix multiplication at a graphics processing unit |
US11657119B2 (en) * | 2018-12-10 | 2023-05-23 | Advanced Micro Devices, Inc. | Hardware accelerated convolution |
US11625393B2 (en) | 2019-02-19 | 2023-04-11 | Mellanox Technologies, Ltd. | High performance computing system |
EP3699770A1 (en) | 2019-02-25 | 2020-08-26 | Mellanox Technologies TLV Ltd. | Collective communication system and methods |
WO2021009901A1 (ja) * | 2019-07-18 | 2021-01-21 | 技術研究組合光電子融合基盤技術研究所 | 並列計算方法およびシステム |
US11481471B2 (en) * | 2019-08-16 | 2022-10-25 | Meta Platforms, Inc. | Mapping convolution to a matrix processor unit |
CN110516793B (zh) * | 2019-08-27 | 2022-06-17 | Oppo广东移动通信有限公司 | 一种池化处理方法及装置、存储介质 |
CN110826687B (zh) * | 2019-08-30 | 2023-11-21 | 安谋科技(中国)有限公司 | 数据处理方法及其装置、介质和系统 |
US12039430B2 (en) * | 2019-11-15 | 2024-07-16 | Samsung Electronics Co., Ltd. | Electronic device and method for inference binary and ternary neural networks |
KR20210071471A (ko) * | 2019-12-06 | 2021-06-16 | 삼성전자주식회사 | 뉴럴 네트워크의 행렬 곱셈 연산을 수행하는 장치 및 방법 |
US11750699B2 (en) | 2020-01-15 | 2023-09-05 | Mellanox Technologies, Ltd. | Small message aggregation |
US11252027B2 (en) | 2020-01-23 | 2022-02-15 | Mellanox Technologies, Ltd. | Network element supporting flexible data reduction operations |
US10713493B1 (en) * | 2020-02-06 | 2020-07-14 | Shenzhen Malong Technologies Co., Ltd. | 4D convolutional neural networks for video recognition |
US11876885B2 (en) | 2020-07-02 | 2024-01-16 | Mellanox Technologies, Ltd. | Clock queue with arming and/or self-arming features |
CN114115995A (zh) * | 2020-08-27 | 2022-03-01 | 华为技术有限公司 | 人工智能芯片及运算板卡、数据处理方法及电子设备 |
CN112491555B (zh) * | 2020-11-20 | 2022-04-05 | 山西智杰软件工程有限公司 | 医疗电子签名的处理方法及电子设备 |
CN112416433B (zh) * | 2020-11-24 | 2023-01-17 | 中科寒武纪科技股份有限公司 | 一种数据处理装置、数据处理方法及相关产品 |
US11556378B2 (en) | 2020-12-14 | 2023-01-17 | Mellanox Technologies, Ltd. | Offloading execution of a multi-task parameter-dependent operation to a network device |
CN112953701B (zh) * | 2021-02-04 | 2023-10-31 | 沈阳建筑大学 | 一种四维混沌电路装置 |
CN112799598B (zh) * | 2021-02-08 | 2022-07-15 | 清华大学 | 一种数据处理方法、处理器及电子设备 |
CN113240570B (zh) * | 2021-04-13 | 2023-01-06 | 华南理工大学 | 一种GEMM运算加速器及基于GoogLeNet的图像处理加速方法 |
CN112990370B (zh) * | 2021-04-26 | 2021-09-10 | 腾讯科技(深圳)有限公司 | 图像数据的处理方法和装置、存储介质及电子设备 |
CN115481713A (zh) * | 2021-06-15 | 2022-12-16 | 瑞昱半导体股份有限公司 | 改进卷积神经网络进行计算的方法 |
KR20230068572A (ko) * | 2021-11-11 | 2023-05-18 | 삼성전자주식회사 | 메모리 어레이 내의 연결 회로 |
CN116150555A (zh) * | 2021-11-19 | 2023-05-23 | 中科寒武纪科技股份有限公司 | 计算装置、利用计算装置实施卷积运算的方法及相关产品 |
US11922237B1 (en) | 2022-09-12 | 2024-03-05 | Mellanox Technologies, Ltd. | Single-step collective operations |
CN117974417B (zh) * | 2024-03-28 | 2024-07-02 | 腾讯科技(深圳)有限公司 | Ai芯片、电子设备及图像处理方法 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488565A (zh) * | 2015-11-17 | 2016-04-13 | 中国科学院计算技术研究所 | 加速深度神经网络算法的加速芯片的运算装置及方法 |
CN105608490A (zh) * | 2015-07-29 | 2016-05-25 | 上海磁宇信息科技有限公司 | 细胞阵列计算系统以及其中的通信方法 |
CN105930902A (zh) * | 2016-04-18 | 2016-09-07 | 中国科学院计算技术研究所 | 一种神经网络的处理方法、系统 |
CN105956659A (zh) * | 2016-05-11 | 2016-09-21 | 北京比特大陆科技有限公司 | 数据处理装置和系统、服务器 |
US20170193368A1 (en) * | 2015-12-30 | 2017-07-06 | Amazon Technologies, Inc. | Conditional parallel processing in fully-connected neural networks |
Family Cites Families (86)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5023833A (en) * | 1987-12-08 | 1991-06-11 | California Institute Of Technology | Feed forward neural network for unary associative memory |
US5956703A (en) * | 1995-07-28 | 1999-09-21 | Delco Electronics Corporation | Configurable neural network integrated circuit |
JPH117438A (ja) * | 1997-06-18 | 1999-01-12 | Fuji Xerox Co Ltd | 積和演算処理方法、装置及び記録媒体 |
JP2001188767A (ja) * | 1999-12-28 | 2001-07-10 | Fuji Xerox Co Ltd | ニューラルネットワーク演算装置及びニューラルネットワークの演算方法 |
US7672952B2 (en) * | 2000-07-13 | 2010-03-02 | Novell, Inc. | System and method of semantic correlation of rich content |
US6925479B2 (en) * | 2001-04-30 | 2005-08-02 | Industrial Technology Research Institute | General finite-field multiplier and method of the same |
US7065544B2 (en) * | 2001-11-29 | 2006-06-20 | Hewlett-Packard Development Company, L.P. | System and method for detecting repetitions in a multimedia stream |
US7737994B1 (en) * | 2003-09-26 | 2010-06-15 | Oracle America, Inc. | Large-kernel convolution using multiple industry-standard graphics accelerators |
US20050125477A1 (en) * | 2003-12-04 | 2005-06-09 | Genov Roman A. | High-precision matrix-vector multiplication on a charge-mode array with embedded dynamic memory and stochastic method thereof |
US7634137B2 (en) * | 2005-10-14 | 2009-12-15 | Microsoft Corporation | Unfolded convolution for fast feature extraction |
GB2453263A (en) * | 2006-05-16 | 2009-04-01 | Douglas S Greer | System and method for modeling the neocortex and uses therefor |
US8644643B2 (en) * | 2006-06-14 | 2014-02-04 | Qualcomm Incorporated | Convolution filtering in a graphics processor |
JP4942095B2 (ja) * | 2007-01-25 | 2012-05-30 | インターナショナル・ビジネス・マシーンズ・コーポレーション | マルチコア・プロセッサにより演算を行う技術 |
US20080288756A1 (en) * | 2007-05-18 | 2008-11-20 | Johnson Timothy J | "or" bit matrix multiply vector instruction |
US8190543B2 (en) * | 2008-03-08 | 2012-05-29 | Tokyo Electron Limited | Autonomous biologically based learning tool |
EP2996035A1 (en) * | 2008-10-15 | 2016-03-16 | Hyperion Core, Inc. | Data processing device |
US20100122070A1 (en) * | 2008-11-07 | 2010-05-13 | Nokia Corporation | Combined associative and distributed arithmetics for multiple inner products |
US20110025816A1 (en) * | 2009-07-31 | 2011-02-03 | Microsoft Corporation | Advertising as a real-time video call |
US8577950B2 (en) * | 2009-08-17 | 2013-11-05 | International Business Machines Corporation | Matrix multiplication operations with data pre-conditioning in a high performance computing architecture |
US8583896B2 (en) * | 2009-11-13 | 2013-11-12 | Nec Laboratories America, Inc. | Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain |
US20110314256A1 (en) * | 2010-06-18 | 2011-12-22 | Microsoft Corporation | Data Parallel Programming Model |
US8577820B2 (en) * | 2011-03-04 | 2013-11-05 | Tokyo Electron Limited | Accurate and fast neural network training for library-based critical dimension (CD) metrology |
US10078620B2 (en) * | 2011-05-27 | 2018-09-18 | New York University | Runtime reconfigurable dataflow processor with multi-port memory access module |
CN102214160B (zh) * | 2011-07-08 | 2013-04-17 | 中国科学技术大学 | 一种基于龙芯3a的单精度矩阵乘法优化方法 |
CN103631761B (zh) * | 2012-08-29 | 2018-02-27 | 睿励科学仪器(上海)有限公司 | 并行处理架构进行矩阵运算并用于严格波耦合分析的方法 |
DE102013104567A1 (de) * | 2013-05-03 | 2014-11-06 | Infineon Technologies Ag | Chipanordnung, Chipkartenanordnung und Verfahren zum Herstellen einer Chipanordnung |
CN103440121B (zh) * | 2013-08-20 | 2016-06-29 | 中国人民解放军国防科学技术大学 | 一种面向向量处理器的三角矩阵乘法向量化方法 |
DE102013109200A1 (de) * | 2013-08-26 | 2015-02-26 | Infineon Technologies Austria Ag | Chip, Chip-Anordnung und Verfahren zum Herstellen eines Chips |
CN104425299B (zh) * | 2013-08-27 | 2017-08-11 | 珠海艾派克微电子有限公司 | 芯片加工装置以及应用芯片加工装置进行芯片加工的方法 |
US20150324686A1 (en) * | 2014-05-12 | 2015-11-12 | Qualcomm Incorporated | Distributed model learning |
CN104036451B (zh) * | 2014-06-20 | 2018-12-11 | 深圳市腾讯计算机系统有限公司 | 基于多图形处理器的模型并行处理方法及装置 |
CN104317352B (zh) * | 2014-10-13 | 2017-10-24 | 中国科学院光电技术研究所 | 一种自适应光学控制系统快速去倾斜分量处理方法 |
CN104346318B (zh) * | 2014-10-15 | 2017-03-15 | 中国人民解放军国防科学技术大学 | 面向通用多核dsp的矩阵乘加速方法 |
CN104463324A (zh) * | 2014-11-21 | 2015-03-25 | 长沙马沙电子科技有限公司 | 一种基于大规模高性能集群的卷积神经网络并行处理方法 |
CN105701120B (zh) * | 2014-11-28 | 2019-05-03 | 华为技术有限公司 | 确定语义匹配度的方法和装置 |
CN104992430B (zh) * | 2015-04-14 | 2017-12-22 | 杭州奥视图像技术有限公司 | 基于卷积神经网络的全自动的三维肝脏分割方法 |
CN104866855A (zh) * | 2015-05-07 | 2015-08-26 | 华为技术有限公司 | 一种图像特征提取方法及装置 |
US10489703B2 (en) | 2015-05-20 | 2019-11-26 | Nec Corporation | Memory efficiency for convolutional neural networks operating on graphics processing units |
US10417555B2 (en) * | 2015-05-29 | 2019-09-17 | Samsung Electronics Co., Ltd. | Data-optimized neural network traversal |
CN104866904B (zh) * | 2015-06-16 | 2019-01-01 | 中电科软件信息服务有限公司 | 一种基于spark的遗传算法优化的BP神经网络并行化方法 |
CN105005911B (zh) * | 2015-06-26 | 2017-09-19 | 深圳市腾讯计算机系统有限公司 | 深度神经网络的运算系统及运算方法 |
CN106293893B (zh) * | 2015-06-26 | 2019-12-06 | 阿里巴巴集团控股有限公司 | 作业调度方法、装置及分布式系统 |
WO2017031630A1 (zh) * | 2015-08-21 | 2017-03-02 | 中国科学院自动化研究所 | 基于参数量化的深度卷积神经网络的加速与压缩方法 |
CN105260776B (zh) * | 2015-09-10 | 2018-03-27 | 华为技术有限公司 | 神经网络处理器和卷积神经网络处理器 |
CN106548124B (zh) * | 2015-09-17 | 2021-09-07 | 松下知识产权经营株式会社 | 主题推定系统、主题推定方法 |
CN106447035B (zh) * | 2015-10-08 | 2019-02-26 | 上海兆芯集成电路有限公司 | 具有可变率执行单元的处理器 |
EP3154001B1 (en) * | 2015-10-08 | 2019-07-17 | VIA Alliance Semiconductor Co., Ltd. | Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory |
CN105373517A (zh) * | 2015-11-09 | 2016-03-02 | 南京大学 | 基于Spark的分布式稠密矩阵求逆并行化运算方法 |
CN105426344A (zh) * | 2015-11-09 | 2016-03-23 | 南京大学 | 基于Spark的分布式大规模矩阵乘法的矩阵计算方法 |
CN105608056A (zh) * | 2015-11-09 | 2016-05-25 | 南京大学 | 一种基于Flink的大规模矩阵并行化的计算方法 |
US11024024B2 (en) * | 2015-12-15 | 2021-06-01 | The Regents Of The University Of California | Systems and methods for analyzing perfusion-weighted medical imaging using deep neural networks |
CN105512723B (zh) * | 2016-01-20 | 2018-02-16 | 南京艾溪信息科技有限公司 | 一种用于稀疏连接的人工神经网络计算装置和方法 |
CN109242094B (zh) * | 2016-01-20 | 2020-05-08 | 中科寒武纪科技股份有限公司 | 用于执行人工神经网络正向运算的装置和方法 |
CN110135581B (zh) * | 2016-01-20 | 2020-11-06 | 中科寒武纪科技股份有限公司 | 用于执行人工神经网络反向运算的装置和方法 |
US11055063B2 (en) * | 2016-05-02 | 2021-07-06 | Marvell Asia Pte, Ltd. | Systems and methods for deep learning processor |
US10796220B2 (en) * | 2016-05-24 | 2020-10-06 | Marvell Asia Pte, Ltd. | Systems and methods for vectorized FFT for multi-dimensional convolution operations |
KR102120396B1 (ko) * | 2016-05-26 | 2020-06-08 | 더 가버닝 카운슬 오브 더 유니버시티 오브 토론토 | 심층 신경망용 가속기 |
CN106126481B (zh) * | 2016-06-29 | 2019-04-12 | 华为技术有限公司 | 一种计算系统和电子设备 |
CN106203621B (zh) * | 2016-07-11 | 2019-04-30 | 北京深鉴智能科技有限公司 | 用于卷积神经网络计算的处理器 |
CN106228240B (zh) * | 2016-07-30 | 2020-09-01 | 复旦大学 | 基于fpga的深度卷积神经网络实现方法 |
US10891538B2 (en) * | 2016-08-11 | 2021-01-12 | Nvidia Corporation | Sparse convolutional neural network accelerator |
US20180046903A1 (en) * | 2016-08-12 | 2018-02-15 | DeePhi Technology Co., Ltd. | Deep processing unit (dpu) for implementing an artificial neural network (ann) |
CN106407561B (zh) * | 2016-09-19 | 2020-07-03 | 复旦大学 | 一种并行gpdt算法在多核soc上的划分方法 |
CN106446546B (zh) * | 2016-09-23 | 2019-02-22 | 西安电子科技大学 | 基于卷积自动编解码算法的气象数据填补方法 |
CN106650922B (zh) * | 2016-09-29 | 2019-05-03 | 清华大学 | 硬件神经网络转换方法、计算装置、软硬件协作系统 |
CN106504232B (zh) * | 2016-10-14 | 2019-06-14 | 北京网医智捷科技有限公司 | 一种基于3d卷积神经网络的肺部结节自动检测系统 |
US9779786B1 (en) * | 2016-10-26 | 2017-10-03 | Xilinx, Inc. | Tensor operations and acceleration |
CN107239824A (zh) * | 2016-12-05 | 2017-10-10 | 北京深鉴智能科技有限公司 | 用于实现稀疏卷积神经网络加速器的装置和方法 |
WO2018103736A1 (en) * | 2016-12-09 | 2018-06-14 | Beijing Horizon Information Technology Co., Ltd. | Systems and methods for data management |
CN106844294B (zh) * | 2016-12-29 | 2019-05-03 | 华为机器有限公司 | 卷积运算芯片和通信设备 |
US10402527B2 (en) * | 2017-01-04 | 2019-09-03 | Stmicroelectronics S.R.L. | Reconfigurable interconnect |
IT201700008949A1 (it) * | 2017-01-27 | 2018-07-27 | St Microelectronics Srl | Procedimento di funzionamento di reti neurali, rete, apparecchiatura e prodotto informatico corrispondenti |
CN106940815B (zh) * | 2017-02-13 | 2020-07-28 | 西安交通大学 | 一种可编程卷积神经网络协处理器ip核 |
CN106951395B (zh) * | 2017-02-13 | 2018-08-17 | 上海客鹭信息技术有限公司 | 面向压缩卷积神经网络的并行卷积运算方法及装置 |
US11132599B2 (en) * | 2017-02-28 | 2021-09-28 | Microsoft Technology Licensing, Llc | Multi-function unit for programmable hardware nodes for neural network processing |
CN107066239A (zh) * | 2017-03-01 | 2017-08-18 | 智擎信息系统(上海)有限公司 | 一种实现卷积神经网络前向计算的硬件结构 |
US10528147B2 (en) * | 2017-03-06 | 2020-01-07 | Microsoft Technology Licensing, Llc | Ultrasonic based gesture recognition |
US11360770B2 (en) * | 2017-03-20 | 2022-06-14 | Intel Corporation | Systems, methods, and apparatuses for zeroing a matrix |
CN106970896B (zh) * | 2017-03-30 | 2020-05-12 | 中国人民解放军国防科学技术大学 | 面向向量处理器的二维矩阵卷积的向量化实现方法 |
US10186011B2 (en) * | 2017-04-28 | 2019-01-22 | Intel Corporation | Programmable coarse grained and sparse matrix compute hardware with advanced scheduling |
US10169298B1 (en) * | 2017-05-11 | 2019-01-01 | NovuMind Limited | Native tensor processor, using outer product unit |
WO2018222900A1 (en) * | 2017-05-31 | 2018-12-06 | Intel Corporation | Computationally-efficient quaternion-based machine-learning system |
US10167800B1 (en) * | 2017-08-18 | 2019-01-01 | Microsoft Technology Licensing, Llc | Hardware node having a matrix vector unit with block-floating point processing |
US10963780B2 (en) * | 2017-08-24 | 2021-03-30 | Google Llc | Yield improvements for three-dimensionally stacked neural network accelerators |
US20190102671A1 (en) * | 2017-09-29 | 2019-04-04 | Intel Corporation | Inner product convolutional neural network accelerator |
US11222256B2 (en) * | 2017-10-17 | 2022-01-11 | Xilinx, Inc. | Neural network processing system having multiple processors and a neural network accelerator |
-
2017
- 2017-08-31 EP EP19212002.0A patent/EP3651031A1/en active Pending
- 2017-08-31 EP EP19212010.3A patent/EP3654208A1/en active Pending
- 2017-08-31 KR KR1020197037903A patent/KR102477404B1/ko active IP Right Grant
- 2017-08-31 CN CN201910531031.2A patent/CN110222308B/zh active Active
- 2017-08-31 EP EP19212365.1A patent/EP3654209A1/en active Pending
- 2017-08-31 CN CN201910530860.9A patent/CN110245751B/zh active Active
- 2017-08-31 CN CN201910534528.XA patent/CN110245752B/zh active Active
- 2017-08-31 EP EP17923228.5A patent/EP3605402B1/en active Active
- 2017-08-31 EP EP19212368.5A patent/EP3654210A1/en active Pending
- 2017-08-31 CN CN201780002287.3A patent/CN109729734B8/zh active Active
- 2017-08-31 WO PCT/CN2017/099991 patent/WO2019041251A1/zh unknown
- 2017-08-31 KR KR1020197029020A patent/KR102467688B1/ko active IP Right Grant
- 2017-08-31 EP EP19211995.6A patent/EP3651030A1/en active Pending
- 2017-08-31 JP JP2019553977A patent/JP7065877B2/ja active Active
- 2017-08-31 CN CN201910534527.5A patent/CN110083390B/zh active Active
- 2017-08-31 CN CN202010628834.2A patent/CN111860815A/zh active Pending
- 2017-08-31 CN CN201910534118.5A patent/CN110231958B/zh active Active
- 2017-08-31 CN CN201910102972.4A patent/CN109902804B/zh active Active
- 2017-08-31 KR KR1020197037895A patent/KR102481256B1/ko active IP Right Grant
-
2018
- 2018-07-25 TW TW107125681A patent/TWI749249B/zh active
- 2018-10-23 US US16/168,778 patent/US11409535B2/en active Active
-
2019
- 2019-10-24 US US16/663,206 patent/US11334363B2/en active Active
- 2019-10-24 US US16/663,205 patent/US11347516B2/en active Active
- 2019-10-24 US US16/663,174 patent/US11775311B2/en active Active
- 2019-10-24 US US16/663,181 patent/US11561800B2/en active Active
- 2019-10-24 US US16/663,210 patent/US11354133B2/en active Active
- 2019-10-24 US US16/663,164 patent/US11531553B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105608490A (zh) * | 2015-07-29 | 2016-05-25 | 上海磁宇信息科技有限公司 | 细胞阵列计算系统以及其中的通信方法 |
CN105488565A (zh) * | 2015-11-17 | 2016-04-13 | 中国科学院计算技术研究所 | 加速深度神经网络算法的加速芯片的运算装置及方法 |
US20170193368A1 (en) * | 2015-12-30 | 2017-07-06 | Amazon Technologies, Inc. | Conditional parallel processing in fully-connected neural networks |
CN105930902A (zh) * | 2016-04-18 | 2016-09-07 | 中国科学院计算技术研究所 | 一种神经网络的处理方法、系统 |
CN105956659A (zh) * | 2016-05-11 | 2016-09-21 | 北京比特大陆科技有限公司 | 数据处理装置和系统、服务器 |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111161705A (zh) * | 2019-12-19 | 2020-05-15 | 上海寒武纪信息科技有限公司 | 语音转换方法及装置 |
CN111126582A (zh) * | 2019-12-20 | 2020-05-08 | 上海寒武纪信息科技有限公司 | 数据处理方法和相关产品 |
CN111126582B (zh) * | 2019-12-20 | 2024-04-05 | 上海寒武纪信息科技有限公司 | 数据处理方法和相关产品 |
CN113743598A (zh) * | 2020-05-27 | 2021-12-03 | 杭州海康威视数字技术股份有限公司 | 一种ai芯片的运行方式的确定方法和装置 |
CN113743598B (zh) * | 2020-05-27 | 2023-08-04 | 杭州海康威视数字技术股份有限公司 | 一种ai芯片的运行方式的确定方法和装置 |
CN114936633A (zh) * | 2022-06-15 | 2022-08-23 | 北京爱芯科技有限公司 | 用于转置运算的数据处理单元及图像转置运算方法 |
Also Published As
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI749249B (zh) | 芯片裝置、芯片、智能設備以及神經網絡的運算方法 | |
CN109615061B (zh) | 一种卷积运算方法及装置 | |
JP6888073B2 (ja) | チップ装置および関連製品 | |
JP6888074B2 (ja) | チップ装置および関連製品 | |
CN109615062B (zh) | 一种卷积运算方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 17923228 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2019553977 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20197029020 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2017923228 Country of ref document: EP Effective date: 20191024 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |