CN109902804B - Pooling operation method and device - Google Patents

Pooling operation method and device Download PDF

Info

Publication number
CN109902804B
CN109902804B CN201910102972.4A CN201910102972A CN109902804B CN 109902804 B CN109902804 B CN 109902804B CN 201910102972 A CN201910102972 A CN 201910102972A CN 109902804 B CN109902804 B CN 109902804B
Authority
CN
China
Prior art keywords
data
circuit
input data
sliding
chip
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910102972.4A
Other languages
Chinese (zh)
Other versions
CN109902804A (en
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Anhui Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Cambricon Information Technology Co Ltd filed Critical Anhui Cambricon Information Technology Co Ltd
Priority to CN201910102972.4A priority Critical patent/CN109902804B/en
Publication of CN109902804A publication Critical patent/CN109902804A/en
Application granted granted Critical
Publication of CN109902804B publication Critical patent/CN109902804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3818Decoding for concurrent execution
    • G06F9/3822Parallel decoding, e.g. parallel decode units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2218/00Aspects of pattern recognition specially adapted for signal processing
    • G06F2218/02Preprocessing

Abstract

The present disclosure provides an operation method and an operation device, wherein the method is applied to a chip device, and the chip device is used for executing operation. The technical scheme provided by the disclosure has the advantage of low energy consumption.

Description

Pooling operation method and device
Technical Field
The application relates to the technical field of chip processing, in particular to a pooling operation method and device.
Background
Artificial Neural Networks (ANN) are a research hotspot in the field of Artificial intelligence since the 80 s of the 20 th century. The method abstracts the human brain neuron network from the information processing angle, establishes a certain simple model, and forms different networks according to different connection modes. It is also often directly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by connecting a large number of nodes (or neurons). The operation of the existing neural network is realized based on a Central Processing Unit (CPU) or a Graphics Processing Unit (GPU), and the operation has high power consumption and long computation time.
Disclosure of Invention
The embodiment of the application provides a pooling operation method and device, which can improve the processing speed of the pooling operation, improve the efficiency and save the power consumption.
In a first aspect, a pooling operation method is provided, which is applied to a chip device and includes the following steps:
the chip device receives input data and a weight, wherein the input data are four-dimensional data, and the four dimensions are as follows: n, H, W, C, respectively; the C direction is the innermost layer of the four-dimensional data and the C direction is the height direction of the four-dimensional data;
the chip device arranges input data along the C direction to obtain input data NHWC or input data NWHC;
the chip device and the k slave circuits perform convolution operation on input data NHWC or input data NWHC and the weight to obtain a convolution calculation result.
In a second aspect, a chip apparatus is provided, the chip apparatus comprising:
the chip device is used for receiving input data and a weight, wherein the input data is four-dimensional data, and the four dimensions are as follows: n, H, W, C, respectively; the C direction is the innermost layer of the four-dimensional data and the C direction is the height direction of the four-dimensional data;
the chip device is also used for arranging the input data along the C direction to obtain input data NWHC or input data NHWC;
the chip device is used for performing convolution operation on input data NHWC or input data NWHC and the weight to obtain a convolution calculation result; and K is an integer greater than or equal to 2.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1a is a schematic structural diagram of a chip apparatus provided in the present disclosure.
Fig. 1b is a schematic structural diagram of another chip device provided in the present disclosure.
Fig. 1c is a data distribution schematic diagram of a chip apparatus provided by the present disclosure.
Fig. 1d is a schematic diagram of data return of a chip device.
Fig. 2 is a flowchart illustrating an operation method of a neural network according to an embodiment of the disclosure.
Fig. 2a is a schematic diagram of a matrix a multiplied by a matrix B according to an embodiment of the disclosure.
Fig. 3 is a flowchart illustrating an operation method of a neural network according to an embodiment of the disclosure.
Fig. 3a is a schematic diagram of single sample data of full connection 1.
Fig. 3b is a diagram of multi-sample data for full connection 2.
FIG. 3c is a graph of M convolution kernel data for convolution 1.
Fig. 3d is a schematic diagram of convolution 2 input data.
FIG. 3e is a diagram of an operation window of a three-dimensional data block of input data.
FIG. 3f is a diagram of another window of operation for a three-dimensional block of input data.
FIG. 3g is a diagram of another exemplary operating window for a three-dimensional data block of input data.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, not all embodiments of the present disclosure. All other embodiments, which can be derived by one of ordinary skill in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of the present disclosure and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the disclosure. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The following describes an operation method of a neural network in which multiplication of a matrix and a matrix is used in a large amount in the neural network, taking a CPU as an example, and operation of the CPU is described here by taking multiplication of a matrix a and a matrix B as an example. Assuming that the result of matrix a and matrix B is C, i.e., C ═ a × B; as shown below:
Figure GDA0002655182110000031
for the CPU, the steps adopted for obtaining C by calculation may be to complete calculation on the first line, then complete calculation on the second line, and finally complete calculation on the third line, that is, for the CPU, the calculation of the second line of data is performed after the calculation of one line of data is completed. Taking the above formula as an example, specifically, first, the CPU completes the calculation on the first line, i.e., needs to complete, a11*b11+a12*b21+a13*b31、a11*b12+a12*b22+a13*b32And a11*b13+a12*b23+a13*b33(ii) a After the above calculation, in the calculation of a21*b11+a22*b21+a23*b31、a21*b12+a22*b22+a23*b32And a21*b13+a22*b23+a23*b33(ii) a Finally, a is recalculated31*b11+a32*b21+a33*b31、a31*b12+a32*b22+a33*b32And a31*b13+a32*b23+a33*b33
Therefore, for a CPU or a GPU, it needs to calculate one line for one line, that is, after the first line is calculated, the second line is calculated, and then the third line is calculated until all lines are calculated.
Referring to fig. 1b, fig. 1b is a schematic structural diagram of a chip apparatus, as shown in fig. 1b, the chip apparatus includes: a main cell circuit, a basic cell circuit, and a branch cell circuit. Wherein the main cell circuit may include a register and/or an on-chip cache circuit, and the main cell may further include: one or any combination of a vector arithmetic unit (vector operator) circuit, an Arithmetic and Logic Unit (ALU) circuit, an accumulator circuit, a matrix transpose circuit, a Direct Memory Access (DMA) circuit, a data rearrangement circuit, and the like; each base unit may include a base register and/or a base on-chip cache circuit; each base unit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like, in any combination. The circuits may all be integrated circuits. If a branch unit is provided, the main unit is connected to the branch unit, the branch unit is connected to the basic unit, the basic unit is used for performing inner product operation between data blocks, the main unit is used for transceiving external data, and distributing the external data to the branch unit, and the branch unit is used for transceiving data of the main unit or the basic unit. The structure shown in fig. 1b is suitable for the computation of complex data, because the number of units connected to the main unit is limited, so that a branch unit needs to be added between the main unit and the basic unit to realize the access of more basic units, thereby realizing the computation of complex data blocks.
The connection structure of the branch unit and the base unit may be arbitrary and is not limited to the H-shaped structure of fig. 1 b. Alternatively, the master unit to the base unit is a broadcast or distribution structure, and the base unit to the master unit is a gather structure. The definitions of broadcast, distribution and collection are as follows:
the data transfer mode from the main unit to the base unit can include:
the main unit is connected to a plurality of branch units, respectively, each of which is connected to a plurality of base units, respectively.
The main unit is connected with a branch unit, the branch unit is connected with a branch unit, and the like, a plurality of branch units are connected in series, and then, each branch unit is connected with a plurality of base units respectively.
The main unit is connected with a plurality of branch units respectively, and each branch unit is connected with a plurality of basic units in series.
The main unit is connected to a branching unit which in turn is connected to a branching unit, and so on, a plurality of branching units are connected in series, and then each branching unit is connected in series with a plurality of base units.
When distributing data, the main unit transmits data to part or all of the basic units, and the data received by each basic unit for receiving data can be different;
when broadcasting data, the master unit transmits data to some or all of the base units, and each base unit receiving data receives the same data.
When collecting data, some or all of the base units transmit the data to the master unit. It should be noted that the chip device shown in fig. 1a or fig. 1b may be a single physical chip, and of course, in practical applications, the chip device may also be integrated into other chips (e.g., CPU, GPU).
Referring to fig. 1c, fig. 1c is a schematic data distribution diagram of a chip device, as shown by an arrow in fig. 1c, the arrow is a distribution direction of data, as shown in fig. 1c, after receiving external data, a main unit splits the external data and distributes the split external data to a plurality of branch units, and the branch units send the split data to a base unit.
Referring to fig. 1d, fig. 1d is a schematic diagram of data return of a chip device, as shown by an arrow in fig. 1d, the arrow is a data return direction, as shown in fig. 1d, a basic unit returns data (e.g., inner product calculation result) to a branch unit, and the branch unit returns the data to a main unit.
Referring to fig. 1a, fig. 1a is a schematic structural diagram of another chip device, which includes: a main unit and a base unit, the main unit being connected to the base unit. The configuration shown in fig. 1a has a limited number of connected base units, which is suitable for simple data calculation, since the base units are physically connected directly to the main unit.
Referring to fig. 2, fig. 2 provides a method for performing a neural network operation using the chip device, where the method is performed by using the chip device shown in fig. 1a or fig. 1b, and the method shown in fig. 2 includes the following steps:
step S201, the main unit of the chip device obtains a data block to be calculated and an operation instruction.
The data block to be calculated in step S201 may specifically be a matrix, a vector, three-dimensional data, four-dimensional data, multidimensional data, etc., and the embodiments of the present disclosure do not limit the specific representation of the data block, and the operation instruction may specifically be a multiplication instruction, a convolution instruction, an addition instruction, a subtraction instruction, a BLAS (Basic Linear Algebra subprogram) function or an activation function, etc.
Step S202, the main unit divides the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction.
The implementation method of the step S202 may specifically be:
if the operation instruction is a multiplication instruction, the multiplier data block is determined to be a broadcast data block, and the multiplicand data block is determined to be a distribution data block.
And if the operation instruction is a convolution instruction, determining the input data block as a broadcast data block and the convolution kernel as a distribution data block.
Step S2031, the master unit splits the distributed data block to obtain a plurality of basic data blocks, distributes the plurality of basic data blocks to a plurality of basic units,
in step S2032, the main unit broadcasts the broadcast data block to a plurality of base units.
Optionally, step S2031 and step S2032 may also be performed in a loop, where when the data amount is large, the master unit splits the distributed data block to obtain a plurality of basic data blocks, splits each basic data block into m basic data sub-blocks, and splits the broadcast data block into m broadcast data sub-blocks, and the master unit distributes one basic data sub-block and one broadcast data sub-block each time, where the basic data sub-block and the broadcast data sub-block are data blocks capable of performing parallel neural network computation. For example, taking a 1000 x 1000 matrix a x 1000 matrix B as an example, the basic data block may be the z-th row of the matrix a, the basic data sub-block may be the first 20 columns of the z-th row of the matrix a, and the broadcast data sub-block may be the first 20 rows of the z-th column of the matrix B.
The basic data block in step S203 may specifically be a minimum data block capable of performing inner product operation, taking matrix multiplication as an example, the basic data block may be a row of data of a matrix, taking convolution as an example, and the basic data block may be a weight of a convolution kernel.
The distribution manner in step S203 may refer to the description of the following embodiment, which is not described herein again, and the method for broadcasting the broadcast data block may also refer to the description of the following embodiment, which is not described herein again.
In step S2041, the basic unit of the chip device performs an inner product operation on the basic data block and the broadcast data block to obtain an operation result (possibly an intermediate result).
Step S2042, if the operation result is not an intermediate result, the operation result is returned to the main unit.
The foregoing turning manner in step S204 can be referred to the following description of the embodiments, and is not described herein again.
In step S205, the main unit processes the operation result to obtain the data block to be calculated and the instruction result of the operation instruction.
The processing manner in step S205 may be accumulation, sorting, and the like, and the disclosure is not limited to the specific manner of the processing, which needs to be configured according to different operation instructions, and may further include performing nonlinear transformation, and the like.
According to the technical scheme, when operation is executed, a main unit receives external data, the external data comprises a data block to be calculated and an operation instruction, the data block to be calculated and the operation instruction are obtained, a distribution data block and a broadcast data block of the data block to be calculated are determined according to the operation instruction, the distribution data block is divided into a plurality of basic data blocks, the broadcast data block is broadcasted to a plurality of basic units, the basic data blocks are distributed to the basic units, the basic units respectively execute inner product operation on the basic data block and the broadcast data block to obtain an operation result, the basic units return the operation result to the main unit, and the main unit obtains an instruction result of the operation instruction according to the returned operation result. The technical point of this technical solution is that, for the neural network, a large amount of operation is the inner product operation between the data block and the data block, the inner product operation has a large overhead and a long calculation time, so the disclosed embodiment first distinguishes the distribution data block and the broadcast data block in the data block to be calculated by the operation instruction and the instruction to be operated, for the broadcast data block, i.e. the data block that must be used to implement the inner product operation, for the distribution data block, which belongs to the data block that can be split in the inner product operation, taking matrix multiplication as an example, the data block to be calculated is the matrix a and the matrix B, the operation instruction is the multiplication instruction (a × B), according to the rule of matrix multiplication, the matrix a is determined to be the distribution data block that can be split, the matrix B is determined to be the broadcast data block, because for the matrix multiplication, the matrix a can be split into a plurality of basic data blocks, the multiplier matrix B may be a broadcast data block. According to the definition of matrix multiplication, the multiplicand matrix A needs to perform inner product operation with the multiplier matrix B respectively for each row of data, so the technical scheme of the application divides the matrix A into M basic data blocks, and each basic data block can be a row of data of the matrix A in the M basic data blocks. Therefore, for matrix multiplication, the calculation time which is relatively long is respectively executed by the plurality of basic units, so that in the inner product calculation, the plurality of basic units can quickly calculate the result in parallel, thereby reducing the calculation time, and the working time of the chip device can be reduced by the less calculation time, thereby reducing the power consumption.
The effects of the technical solutions provided by the present disclosure are illustrated below by practical examples. As shown in fig. 2a, it is a schematic diagram of a matrix a multiplied by a vector B, as shown in fig. 2a, the matrix a has M rows and L columns, and the vector B has L rows, and assuming that the time required for an operator to calculate the inner product of one row of the matrix a and the vector B is T1, if the CPU or GPU calculates that it needs to calculate one row and then proceed to the next row, the time T0 calculated by the GPU or CPU calculation method is M × T1. With the solution provided by the embodiment of the present disclosure, assuming that the basic units have M basic units, the matrix a is split into M basic data blocks, each basic data block is a row of data of the matrix a, and the M basic units perform the inner product operation simultaneously, the computation time is T1, and the time required for the solution provided by the embodiment of the present disclosure is T1 ═ T1+ T2+ T3, where T2 may be the time for the main unit to split data, and T3 may be the time required for processing the operation result of the inner product operation to obtain the instruction result, and since the computation amount of splitting data and processing the operation result is very small, the time required is very small, so T0 > T1, the solution according to the embodiment of the present disclosure can significantly reduce the computation time, and for the power consumption generated by the data to be operated, since T0 > T1, the chip device provided by the present disclosure has the advantage of saving power consumption because its operation time is extremely short, and experiments prove that when the operation time of the chip device is very short, the power consumption is much lower than that of the chip device with long operation time.
There are various implementations of the main unit broadcasting the broadcast data block to the multiple basic units in step S203, and specifically, the implementation may be:
the first mode is to broadcast the broadcast data block to the plurality of basic units by one time. (the broadcast refers to performing "one-to-many" data transmission, i.e., transmitting the same data block to a plurality (all or a part of) basic units by the master unit at the same time.) for example, the matrix a is the matrix B, where the matrix B is a broadcast data block, and the matrix B is broadcast to the plurality of basic units at one time, and for example, in convolution, the input data is a broadcast data block, and the input data block is broadcast to the plurality of basic units at one time. This has the advantage that the amount of data transmission between the master unit and the base unit can be saved, i.e. all broadcast data can be transmitted to a plurality of base units via only one broadcast.
The second method is to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic units for a plurality of times, for example, the matrix B broadcasts to the plurality of basic units for a plurality of times, specifically, broadcasts N columns of data of the matrix B each time. The method has the advantages that the configuration of the basic unit can be reduced, because the storage space of the register configured for the basic unit cannot be large, if the matrix B with larger data size is issued to the basic unit at one time, the basic unit needs larger register space for storing the data, because the number of the basic units is large, the increase of the register space inevitably has great influence on the increase of the cost, the scheme of broadcasting the broadcast data block for multiple times is adopted at the moment, namely, for the basic unit, only part of data of the broadcast data block broadcasted every time needs to be stored, and therefore the cost is reduced.
It should be noted that the above-mentioned method a or method b may be adopted to distribute a plurality of basic data blocks to a plurality of basic units in the above-mentioned step S203, and the difference is only that the transmission method is a unicast method and the transmitted data is a basic data block.
The implementation method of the step S204 may specifically be:
if the broadcast data block is broadcast in the mode a and the basic data block is distributed in the mode a (as shown in fig. 3 a), the basic unit performs the inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, that is, one row of inner product operation is performed at a time, and the inner product processing result (one of the operation results) is sent to the main unit, and the main unit accumulates the inner product processing result. The above-mentioned method can reduce the data transmission quantity between main unit and basic unit, and then raise the speed of calculation.
If the mode b broadcast data block is adopted, in an optional technical solution, the base unit performs a partial inner product operation of the basic data block and the partial broadcast data block once every time the base unit receives the partial broadcast data block to obtain a partial processing result, the base unit sends the processing result to the main unit, and the main unit accumulates the processing result. In another alternative, if the number of the basic data blocks received by the base unit is n, the broadcast data block is multiplexed to perform the integral operation of the broadcast data block and the n basic data blocks to obtain n partial processing results, the base unit sends the n processing results to the main unit, and the main unit respectively accumulates the n processing results. Of course the above-mentioned accumulation can also be performed in the basic unit.
The data amount of the broadcast data block is very large and the distribution data block is also large for the above case, because for the chip device, because it belongs to the configuration of hardware, the basic unit of the configuration can be numerous in theory, but in practice, the number is limited, generally dozens of basic units, and the number may be changed, for example, increased, continuously with the development of technology. However, in the operation of matrix-by-matrix of the neural network, the number of rows of the matrix a may be thousands of rows, and the number of columns of the matrix B also may be thousands of columns, so that one broadcast data issue the matrix B to the basic unit cannot be realized, so the implementation manner may be that, one broadcast data broadcasts part of the data of the matrix B, for example, the first 5 columns of data, and the matrix a may also adopt a similar manner, for the basic unit, it may perform partial inner product calculation each time, then store the result of the partial inner product calculation in the register, and after all the inner product operations of the row are performed, accumulate the results of all the partial inner product calculations of the row to obtain an operation result, and send the operation result to the main unit. This approach has the advantage of increasing the speed of computation.
Referring to fig. 3, fig. 3 provides a calculation method of a neural network, in this embodiment, the calculation is described in a calculation manner of a matrix a × matrix B, which may be a schematic matrix shown in fig. 3a, for convenience of description, the calculation method of the neural network shown in fig. 3 is performed in a chip device shown in fig. 1B, which has 16 basic units, as shown in fig. 1B, and for convenience of description and allocation, it is set that a value of M shown in fig. 3a may be 32, a value of N may be 15, and a value of L may be 20. It will of course be appreciated that any number of base units may be present in the computing device. The method, as shown in fig. 3, includes the following steps:
step S301, the main unit receives the matrix A, the matrix B and the multiplication instruction A and B.
Step S302, the main unit determines that the matrix B is a broadcast data block and the matrix A is a distribution data block according to the multiplication instruction A and B, and splits the matrix A into 32 basic data blocks, wherein each basic data block is a row of data of the matrix A.
Step S303, the main unit uniformly allocates 32 basic data blocks to 16 basic units, and uniformly allocates 32 basic data blocks to 16 basic units, that is, each basic unit receives 2 basic data blocks, and the allocation manner of the two basic data blocks may be any non-repeating allocation order.
The allocation manner of step S303 may adopt some other allocation manners, for example, when the number of data blocks cannot be evenly allocated to each base unit, the database may be unevenly allocated to each base unit; some of the data blocks that cannot be evenly divided may also be divided and then evenly distributed, and the embodiments of the present disclosure do not limit how the basic data blocks are distributed to a plurality of basic units.
In step S304, the main unit extracts partial data of the first few columns (for example, the first 5 columns) of the matrix B, and the matrix B broadcasts the partial data of the first 5 columns to the 16 basic units.
And S305, the 16 basic units perform inner product operation and accumulation operation on the partial data of the first 5 rows and 2 basic data blocks for twice multiplexing to obtain 32 multiplied by 5 preprocessing results, and the 32 multiplied by 5 preprocessing results are sent to the main unit.
Step S306, the main unit extracts the partial data of the middle 5 columns of the matrix B, and the matrix B broadcasts the partial data of the middle 5 columns to the 16 basic units.
And step S307, the 16 basic units perform inner product operation and accumulation operation on the partial data of the middle 5 columns and the 2 basic data blocks for twice multiplexing to obtain 32 multiplied by 5 middle processing results, and the 32 multiplied by 5 middle processing results are sent to the main unit.
Step S308, the main unit extracts the partial data of the last 5 columns of the matrix B, and the matrix B broadcasts the partial data of the last 5 columns to the 16 basic units.
Step S309, the 16 basic units multiplex the partial data of the last 5 columns and 2 basic data blocks for twice to execute inner product operation and accumulation operation to obtain 32 multiplied by 5 post-processing results, and the 32 multiplied by 5 post-processing results are sent to the main unit.
And S310, the main unit combines the 32 x 5 pre-processing results, the 32 x 5 intermediate processing results and the 32 x 5 post-processing results according to the pre-processing, the intermediate processing and the post-processing to obtain a matrix C of 32 x 15, wherein the matrix C is the instruction result of the matrix A x the matrix B.
The technical scheme shown in fig. 3 is to divide the matrix a into 32 basic data blocks and then broadcast the matrix B in batches, so that the basic units can obtain instruction results in batches, and since the inner product is divided into 16 basic units for calculation, the calculation time can be greatly reduced, and therefore, the method has the advantages of short calculation time and low energy consumption.
Referring to fig. 1a, fig. 1a is a chip apparatus provided by the present disclosure, the chip apparatus including: the main unit is a hardware chip unit, and the basic unit is also a hardware chip unit;
the main unit is used for executing each continuous operation in the neural network operation and transmitting data with the basic unit;
and the basic unit is used for executing parallel acceleration operation in a neural network according to the data transmitted by the main unit and transmitting an operation result to the main unit.
The parallel accelerated operations include, but are not limited to: multiplication, convolution and the like between data blocks are massive and can be operated in parallel.
Each of the above successive operations includes, but is not limited to: accumulation operation, matrix transposition operation, data sorting operation, and the like.
The system comprises a main unit and a plurality of basic units, wherein the main unit is used for acquiring a data block to be calculated and an operation instruction, and dividing the data block to be calculated into a distribution data block and a broadcast data block according to the operation instruction; splitting the distribution data block to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the plurality of basic units, and broadcasting the broadcast data block to the plurality of basic units; the basic unit is used for executing inner product operation on the basic data block and the broadcast data block to obtain an operation result and sending the operation result to the main unit; and the main unit is used for processing the operation result to obtain the data block to be calculated and an instruction result of the operation instruction.
Optionally, the chip device further includes: a branch unit disposed between the main unit and the base unit; the branch unit is used for forwarding data.
Optionally, the master unit is specifically configured to broadcast the broadcast data block to the plurality of base units at a time.
Optionally, the basic unit is specifically configured to perform inner product processing on the basic data block and the broadcast data block to obtain an inner product processing result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main unit.
Optionally, the main unit is configured to, when the operation result is a result of inner product processing, accumulate the operation result to obtain an accumulation result, and arrange the accumulation result to obtain the data block to be calculated and an instruction result of the operation instruction.
Optionally, the master unit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of base units by multiple times.
Optionally, the basic unit is specifically configured to perform an inner product processing on the partial broadcast data block and the basic data block once to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main unit.
Optionally, the base unit is specifically configured to multiplex the partial broadcast data block n times to perform an operation of performing an inner product of the partial broadcast data block and the n basic data blocks to obtain n partial processing results, accumulate the n partial processing results respectively to obtain n partial operation results, and send the n partial operation results to the main unit, where n is an integer greater than or equal to 2.
The present disclosure also provides an application method of the chip apparatus shown in fig. 1a, which can be specifically used for performing one or any combination of matrix multiplication, matrix multiplication vector, convolution or full-link operation.
Specifically, the master unit may also perform a pooling operation, a regularization operation, and a neural network operation step such as batch normalization, lrn.
The present embodiments also provide a chip comprising a chip arrangement as shown in fig. 1a or 1 b.
The specific embodiment of the present application also provides an intelligent device, which includes the above chip, and the chip is integrated with a chip apparatus as shown in fig. 1a or fig. 1 b. The smart devices include, but are not limited to: smart devices such as smart phones, tablet computers, personal digital assistants, smart watches, smart cameras, smart televisions, smart refrigerators, and the like are merely examples, and embodiments of the present application are not limited to specific manifestations of the devices.
The matrix-by-matrix operation described above can be seen in the description of the embodiment shown in fig. 3. And will not be described in detail herein.
Performing full connection operation by using a chip device;
if the input data of the fully-connected layer is a vector with length L (such as vector B in the fully-connected 1-single sample shown in FIG. 3 a) (i.e. the input of the neural network is a single sample), the output of the fully-connected layer is a vector with length M, the weight of the fully-connected layer is a matrix with length M (such as matrix A in the fully-connected 1-single sample shown in FIG. 3B), the weight matrix of the fully-connected layer is used as matrix A (i.e. splitting data block), the input data is used as vector B (i.e. broadcasting data block), and the operation is performed according to the first method shown in FIG. 2. The specific operation method may be:
if the input data of the fully-connected layer is a matrix (i.e. the input of the neural network is the case where a plurality of samples are operated together as a batch) (the input data of the fully-connected layer represents N input samples, each sample is a vector of length L, the input data is represented by an L x N matrix, such as the matrix B in "fig. 3B fully-connected 1-multisample"), the output of the fully-connected layer is a vector of length M for each sample, the output data of the fully-connected layer is an M x N matrix, such as the result matrix in "fig. 3a fully-connected 1-multisample", the weight of the fully-connected layer is an M x L matrix (such as the matrix a in "fig. 3a fully-connected 1-multisample"), the weight matrix of the fully-connected layer is taken as the matrix a (i.e. the splitting data block), the matrix of the input data is taken as the matrix B (i.e. the broadcasting data block), or taking the weight matrix of the full link layer as the matrix B (i.e. broadcast data block) and the input vector as the matrix a (i.e. split data block), and performing the operation according to the first method shown in fig. 2.
Chip device
When the chip device is used to perform an artificial neural network operation, input data of a convolutional layer, a pooling layer, a regularization layer (also called a normalization layer, such as bn (batch normalization) or lrn (local Response normalization)) and the like in the neural network are shown as "convolution 2-input data" in fig. 3d (for clarity, C is 5, H is 10, and W is 12 are used as examples of a three-dimensional data block representing each sample, and the size of N, C, H, and W is not limited to the values shown in fig. 3d in actual use), where each three-dimensional data block in fig. 3d represents input data corresponding to one sample in the layer, and three dimensions of each three-dimensional data block are C, H and W, and there are N such three-dimensional data blocks in total.
When the calculation of the neural network layers is carried out, after the main unit receives input data, the data rearrangement circuit of the main unit is used for arranging the input data according to a certain sequence for each sample of the input data, and the sequence can be any sequence;
optionally, the order may place the input data, such as NHWC and NWHC, in a manner that the C-dimension coordinate represented by the schematic diagram changes most quickly. Where C represents the dimension of the innermost layer of the data block, N represents the dimension of the outermost layer of the data block, and H and W are the dimensions of the middle layer. This has the effect that the data of C are close together, which makes it easy to improve the parallelism of the operation and to perform parallel operation on a plurality of Feature maps (Feature maps).
The following explains how C, H and W understand for different neural network operations. For convolution and pooling, H and W are the relative window sliding dimensions (an exemplary graph of sliding the window in the W dimension is shown in FIG. 3e convolution 3-slide a "and" FIG. 3f convolution 3-slide b ", and a schematic graph of sliding the window in the H dimension is shown in FIG. 3g, where the size of the window coincides with the size in one of the M convolution kernels, M convolution kernels are shown in FIG. 3c, each convolution kernel is a three-dimensional block of 5 3H 3, then the window of operation is also a three-dimensional block of 5H 3, for KH and KW of the M convolution kernels shown in FIG. 3c represent the H dimension of the input data for KH, the corresponding dimension of which is the W dimension of the input data, and the grey portion blocks of FIGS. 3e, 3f, 3g are the data used for each sliding window of operation, the sliding direction may be H, W, or H after the sliding direction is H. Specifically, for convolution, the operation at each sliding window is to perform inner product operation on a data block represented by a gray part square in the graph and M convolution kernel data blocks represented by "convolution 1-convolution kernel" in fig. 3c, and the convolution outputs a value corresponding to each convolution kernel for each sliding window position, that is, M output values exist for each sliding window; for pooling, the operation at each sliding window is to select the maximum value in H and W dimensions (in the example of the figure, 9 numbers on the same plane in the gray data block) of the data block represented by the gray square in the figure, or to calculate the average value, etc., and pooling outputs C values for each sliding window position. C is the other dimension of the three-dimensional block of single samples, except H and W, where N represents a total of N samples for this layer of operations to be performed simultaneously. For an LRN in a regularization algorithm, the definition of the C dimension is: each basic LRN operation selects a continuous data block (namely a data block of Y multiplied by 1) along the C dimension, wherein Y in the data block of Y multiplied by 1 is a value on the C dimension, the value of Y is less than or equal to the maximum value of the C dimension, the first 1 represents the H dimension, and the second 1 represents the W dimension; the remaining two dimensions are defined as H and W dimensions, that is, for each LRN regularization operation performed on the three-dimensional data block of each sample, a continuous portion of data in different C coordinates in the same W coordinate and the same H coordinate is performed. For the regularization algorithm BN, the values of all the coordinates in the three-dimensional data block of N samples that have the same C dimension are averaged and squared (or standard deviation).
In the above "fig. 3 c-fig. 3 g", a square is used to represent a numerical value, which may also be referred to as a weight; the numbers used in the diagram are only for illustration, and the dimension data may be any number in practical cases (including a case where a certain dimension is 1, in which case the four-dimensional data block automatically becomes a three-dimensional data block, for example, when the number of samples calculated at the same time is 1, the input data is a three-dimensional data block, and for example, when the number of convolution kernels is 1, the volume sum data is a three-dimensional data block). Performing convolution operation between input data B and a convolution kernel A by using the chip device;
for a convolutional layer, the weight (all convolution kernels) is shown as "convolution 1-convolution kernel" in fig. 3C, the number of convolution kernels is recorded as M, and each convolution kernel is composed of C matrices of KH rows and KW columns, so the weight of the convolutional layer can be represented as a four-dimensional data block with four dimensions of M, C, KH and KW; the input data of the convolutional layer is a four-dimensional data block which is composed of N three-dimensional data blocks, and each three-dimensional data block is composed of C characteristic matrixes of H rows and W columns (namely, the four dimensions are respectively N, C, H and W data blocks); as shown in fig. 3d convolution 2-input data. Distributing the weight of each convolution kernel in the M convolution kernels from the main unit to one of the K basic units, and storing the weight in an on-chip cache and/or a register of the basic unit (at this time, the M convolution kernels are distribution data blocks, each convolution kernel may be a basic data block, and in practical application, the basic data block may also be changed to a smaller temperature, for example, a planar matrix of one convolution kernel); the specific distribution method may be: if the number M of the convolution kernels is less than K, distributing a weight of the convolution kernel to the M basic units respectively; and if the number M of the convolution kernels is larger than K, distributing the weight values of one or more convolution kernels to each basic unit respectively. (the set of convolution kernel weights distributed to the ith base unit is Ai, with a total of Mi convolution kernels.) in each base unit, e.g., the ith base unit: storing the received convolution kernel weight Ai distributed by the main unit in a register and/or an on-chip cache of the main unit; transmitting each part (i.e. the sliding window shown in fig. 3e, fig. 3f or fig. 3 g) in the input data to each basic unit in a broadcast manner (the broadcast manner may be the first or the second manner), and in the broadcast, broadcasting the weight of the calculation window to all basic units in a manner of broadcasting for multiple times, specifically, broadcasting the weight of the calculation window for each time for a part, for example, broadcasting a matrix of one plane each time, taking fig. 3e as an example, broadcasting a KH HW matrix of one C plane each time, and in practical application, broadcasting data of the first n rows or the first n columns in a KH HW matrix of one C plane at a time, which does not limit the transmission manner of the part data and the arrangement manner of the part data; the placing mode of the input data is changed into a placing mode of any dimensionality sequence, and then all parts of the input data are sequentially broadcasted to the basic unit in sequence. Optionally, the sending method of the distribution data, i.e., the convolution kernel, may also adopt a method similar to the operation window of the input data, and is not described here again. Optionally, the placing mode of the input data is changed into a cycle with C as the innermost layer. This has the effect that the data of C are close together, thereby increasing the parallelism of the convolution operation and facilitating parallel operation of a plurality of Feature maps (Feature maps). Optionally, the placing mode of the input data is converted into each basic unit, for example, the ith basic unit, of which the dimensional order is NHWC or NWHC, and the inner product of the convolution kernel in the weight Ai and the corresponding part (i.e., the operation window) of the received broadcast data is calculated; the data of the corresponding part in the weight Ai can be directly read from the on-chip buffer for use, or can be read into a register first for multiplexing. The results of the inner product operations of each base unit are accumulated and transmitted back to the master unit. The partial sum obtained by performing the inner product operation by each basic unit can be transmitted back to the main unit for accumulation; the partial sum obtained by the inner product operation executed by the basic unit each time can be stored in a register and/or an on-chip cache of the basic unit, and is transmitted back to the main unit after the accumulation is finished; or the partial sum obtained by the inner product operation executed by the basic unit each time is stored in a register and/or an on-chip buffer of the basic unit in partial cases for accumulation, is transmitted to the main unit for accumulation in partial cases, and is transmitted back to the main unit after the accumulation is finished.
Method for realizing BLAS (Basic Linear Algebra subprogram) function by adopting chip device
GEMM, GEMM calculation refers to: the operation of matrix-matrix multiplication in the BLAS library. The general representation of this operation is: c ═ alpha (op) (a) op (B) + beta (C) C, where a and B are two input matrices, C is the output matrix, alpha and beta are scalars, op represents some operation on matrix a or B, and there are additional integers as parameters to account for the width and height of matrices a and B;
the steps of using the device to realize GEMM calculation are as follows:
carrying out respective corresponding op operations on the input matrix A and the matrix B; the op operation may be a transpose operation of the matrix, but may of course be other operations, such as non-linear function operations, pooling, etc. Realizing the operation of the matrix op by utilizing the vector operation function of the main unit; if the op of a certain matrix can be null, the master unit does not perform any operation on the matrix;
the matrix multiplication between op (A) and op (B) is completed by adopting the method shown in FIG. 2;
multiplying each value in the result of op (A) and op (B) by alpha by using the vector operation function of the main unit;
a step of adding corresponding positions of the matrix alpha op (A) op (B) and beta C by using the vector operation function of the main unit;
GEMV
the GEMV calculation means: the operation of matrix-vector multiplication in the BLAS library. The general representation of this operation is: c ═ alpha (op) (a) B + beta (C), where a is the input matrix, B is the vector of inputs, C is the output vector, alpha and beta are scalars, op represents some operation on the matrix a;
the steps of using the device to realize GEMV calculation are as follows:
carrying out corresponding op operation on the input matrix A; the chip device completes the matrix-vector multiplication calculation between the matrix op (A) and the vector B by using the method shown in FIG. 2; multiplying each value in the result of op (A) and (B) by alpha by using the vector operation function of the main unit; and a step of adding the corresponding positions of the matrixes alpha, op, (A), B and beta, C by using the vector operation function of the main unit.
Method for realizing activation function by chip device
An activation function generally refers to performing a non-linear operation on each number in a block of data (which may be a vector or a multidimensional matrix). For example, the activation function may be: y ═ max (m, x), where x is the input value, y is the output value, and m is a constant; the activation function may also be: y ═ tanh (x), where x is the input value and y is the output value; the activation function may also be: y is sigmoid (x), where x is the input value and y is the output value; the activation function may also be a piecewise linear function; the activation function may be any function that inputs a number and outputs a number.
When the activation function is realized, the chip device inputs a vector by utilizing the vector calculation function of the main unit and calculates the activation vector of the vector; the main unit calculates a value to be output to a corresponding position of the output vector by each value in the input vector through an activation function (the input value of the activation function is one value, and the output value is also one value);
sources of the above input vectors include, but are not limited to: external data of the chip device, and calculation result data of the basic unit forwarded by the branch unit of the chip device.
The calculation result data may be specifically an operation result of performing matrix multiplication on a vector; the calculation result data may specifically be a result of matrix multiplication; the input data may be the result of the calculation performed for the master unit after biasing.
Using chip devices to implement biasing operations
The function of adding two vectors or two matrixes can be realized by using the master unit; the function of adding a vector to each row, or to each column, of a matrix can be implemented by the master unit.
Alternatively, the matrix may be the result of the device performing a matrix-by-matrix operation; the matrix may be from the result of the device performing a matrix multiply vector operation; the matrix may be from data received externally by the main unit of the apparatus. The vector may be from data received externally by a master unit of the apparatus.
The input data and the calculation result data are only examples, and in practical applications, the input data and the calculation result data may also be other types or sources of data.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated units/modules are all implemented in hardware. For example, the hardware may be circuitry, including digital circuitry, analog circuitry, and so forth. Physical implementations of hardware structures include, but are not limited to, physical devices including, but not limited to, transistors, memristors, and the like. The computing module in the computing device may be any suitable hardware processor, such as a CPU, GPU, FPGA, DSP, ASIC, and the like. The memory unit may be any suitable magnetic or magneto-optical storage medium, such as RRAM, DRAM, SRAM, EDRAM, HBM, HMC, etc.
The illustrated elements may or may not be physically separate, may be located in one place, or may be distributed across multiple network elements. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for the person skilled in the art, based on the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as limiting the present disclosure.

Claims (13)

1. A pooling operation method, wherein the method is applied to a chip device, and the method comprises the following steps:
the chip device receives input data, wherein the input data are four-dimensional data, and the four dimensions are as follows: n, H, W, C, respectively; the C dimension is a height direction dimension of the four-dimensional data;
the chip device converts the placing mode of the input data into input data with C dimension as the innermost layer to obtain the input data with the innermost layer as C;
the chip device executes pooling operation on input data with the innermost layer of C to obtain a pooling calculation result;
the chip device includes: the chip device performs pooling operation on input data with an innermost layer of C to obtain pooled calculation results, and specifically comprises:
the main circuit obtains a sliding operation window in input data by taking H and W as sliding directions to obtain data used for operation, and each sliding window operation is to perform pooling operation on the data used for operation obtained by the sliding window to obtain a value of a calculation result.
2. The method according to claim 1, wherein the transforming the placement of the input data into the input data with the C dimension as the innermost layer to obtain the input data with the innermost layer as C specifically comprises:
and converting the placing mode of the input data into a cycle with the C dimension as the innermost layer to obtain input data NHWC or input data NWHC.
3. The method according to claim 1, wherein the obtaining of the data used for the sliding operation window by taking H and W as sliding directions specifically comprises:
obtaining data of a sliding operation window by taking W as a sliding direction after sliding is finished by taking the dimension H as the sliding direction;
or obtaining the data of the sliding operation window by taking H as the sliding direction after the sliding is finished by taking the dimension W as the sliding direction.
4. The method of claim 1,
the pooling operation is a maximum operation or an average operation.
5. The method of claim 1, 3 or 4, wherein the chip arrangement further comprises: a branch circuit connecting the master circuit and a slave circuit, the method further comprising:
the branch circuit forwards data between the master circuit and the slave circuit.
6. The method according to claim 1, 3 or 4, characterized in that the primary circuit comprises: one or any combination of a vector arithmetic unit circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit or a data rearrangement circuit;
the slave circuit includes: inner product arithmetic circuit or accumulator circuit or any combination.
7. A chip apparatus, the chip apparatus comprising:
the chip device is used for receiving input data, wherein the input data are four-dimensional data, and the four dimensions are as follows: n, H, W, C, respectively; the C dimension is a height direction dimension of the four-dimensional data;
the chip device is also used for converting the placing mode of the input data into the input data with C dimension as the innermost layer to obtain the input data with C dimension as the innermost layer;
the chip device is used for executing pooling operation on the input data with the innermost layer of C to obtain a pooling calculation result;
the chip arrangement further comprises: a master circuit and k slave circuits;
the main circuit is specifically used for acquiring a sliding operation window in input data by taking H and W as sliding directions to obtain data used for operation, and each sliding window operation is to perform pooling operation on the data used for operation obtained by the sliding window to obtain a value of an operation result.
8. The chip apparatus according to claim 7,
the chip device is specifically used for converting the placing mode of input data into C dimension as an innermost layer to obtain input data NHWC or input data NWHC.
9. The chip apparatus according to claim 7, wherein the obtaining of the data used for the sliding operation window by using H and W as sliding directions specifically includes:
obtaining data of a sliding operation window by taking W as a sliding direction after sliding is finished by taking the dimension H as the sliding direction;
or obtaining the data of the sliding operation window by taking H as the sliding direction after the sliding is finished by taking the dimension W as the sliding direction.
10. The chip apparatus according to claim 7,
the pooling operation is a maximum operation or an average operation.
11. The chip arrangement according to claim 7, 9 or 10, characterized in that the chip arrangement further comprises: a branch circuit connecting the master circuit and the slave circuit,
the branch circuit is used for forwarding data between the main circuit and the slave circuit.
12. Chip arrangement according to claim 7, 9 or 10, characterized in that the main circuit comprises: one or any combination of a vector arithmetic unit circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit or a data rearrangement circuit;
the slave circuit includes: inner product arithmetic circuit or accumulator circuit or any combination.
13. A smart device, characterized in that it comprises a chip integrating a chip arrangement according to any one of claims 7-12.
CN201910102972.4A 2017-08-31 2017-08-31 Pooling operation method and device Active CN109902804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910102972.4A CN109902804B (en) 2017-08-31 2017-08-31 Pooling operation method and device

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN201910102972.4A CN109902804B (en) 2017-08-31 2017-08-31 Pooling operation method and device
PCT/CN2017/099991 WO2019041251A1 (en) 2017-08-31 2017-08-31 Chip device and related product
CN201780002287.3A CN109729734B8 (en) 2017-08-31 2017-08-31 Chip device and related product

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201780002287.3A Division CN109729734B8 (en) 2017-08-31 2017-08-31 Chip device and related product

Publications (2)

Publication Number Publication Date
CN109902804A CN109902804A (en) 2019-06-18
CN109902804B true CN109902804B (en) 2020-12-18

Family

ID=65436282

Family Applications (8)

Application Number Title Priority Date Filing Date
CN201910530860.9A Active CN110245751B (en) 2017-08-31 2017-08-31 GEMM operation method and device
CN201910534118.5A Active CN110231958B (en) 2017-08-31 2017-08-31 Matrix multiplication vector operation method and device
CN201910102972.4A Active CN109902804B (en) 2017-08-31 2017-08-31 Pooling operation method and device
CN201910534528.XA Active CN110245752B (en) 2017-08-31 2017-08-31 Method and device for carrying out full-connection operation by using chip device
CN201910534527.5A Active CN110083390B (en) 2017-08-31 2017-08-31 GEMV operation method and device
CN202010628834.2A Pending CN111860815A (en) 2017-08-31 2017-08-31 Convolution operation method and device
CN201910531031.2A Active CN110222308B (en) 2017-08-31 2017-08-31 Matrix multiplication matrix operation method and device
CN201780002287.3A Active CN109729734B8 (en) 2017-08-31 2017-08-31 Chip device and related product

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN201910530860.9A Active CN110245751B (en) 2017-08-31 2017-08-31 GEMM operation method and device
CN201910534118.5A Active CN110231958B (en) 2017-08-31 2017-08-31 Matrix multiplication vector operation method and device

Family Applications After (5)

Application Number Title Priority Date Filing Date
CN201910534528.XA Active CN110245752B (en) 2017-08-31 2017-08-31 Method and device for carrying out full-connection operation by using chip device
CN201910534527.5A Active CN110083390B (en) 2017-08-31 2017-08-31 GEMV operation method and device
CN202010628834.2A Pending CN111860815A (en) 2017-08-31 2017-08-31 Convolution operation method and device
CN201910531031.2A Active CN110222308B (en) 2017-08-31 2017-08-31 Matrix multiplication matrix operation method and device
CN201780002287.3A Active CN109729734B8 (en) 2017-08-31 2017-08-31 Chip device and related product

Country Status (7)

Country Link
US (7) US11409535B2 (en)
EP (6) EP3654209A1 (en)
JP (1) JP7065877B2 (en)
KR (3) KR102467688B1 (en)
CN (8) CN110245751B (en)
TW (1) TWI749249B (en)
WO (1) WO2019041251A1 (en)

Families Citing this family (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109992743B (en) * 2017-12-29 2020-06-16 华为技术有限公司 Matrix multiplier
CN116991226A (en) * 2018-02-14 2023-11-03 上海寒武纪信息科技有限公司 Control device, method and equipment of processor
CN110210610B (en) * 2018-03-27 2023-06-20 腾讯科技(深圳)有限公司 Convolution calculation accelerator, convolution calculation method and convolution calculation device
US11277455B2 (en) 2018-06-07 2022-03-15 Mellanox Technologies, Ltd. Streaming system
US20200106828A1 (en) * 2018-10-02 2020-04-02 Mellanox Technologies, Ltd. Parallel Computation Network Device
CN110162799B (en) * 2018-11-28 2023-08-04 腾讯科技(深圳)有限公司 Model training method, machine translation method, and related devices and equipment
US11175946B2 (en) * 2018-12-06 2021-11-16 Advanced Micro Devices, Inc. Pipelined matrix multiplication at a graphics processing unit
US11657119B2 (en) * 2018-12-10 2023-05-23 Advanced Micro Devices, Inc. Hardware accelerated convolution
US11625393B2 (en) 2019-02-19 2023-04-11 Mellanox Technologies, Ltd. High performance computing system
EP3699770A1 (en) 2019-02-25 2020-08-26 Mellanox Technologies TLV Ltd. Collective communication system and methods
JPWO2021009901A1 (en) * 2019-07-18 2021-09-13 技術研究組合光電子融合基盤技術研究所 Parallel computing method and system
US11481471B2 (en) * 2019-08-16 2022-10-25 Meta Platforms, Inc. Mapping convolution to a matrix processor unit
CN110516793B (en) * 2019-08-27 2022-06-17 Oppo广东移动通信有限公司 Pooling processing method and device and storage medium
CN110826687B (en) * 2019-08-30 2023-11-21 安谋科技(中国)有限公司 Data processing method and device, medium and system thereof
US20210150313A1 (en) * 2019-11-15 2021-05-20 Samsung Electronics Co., Ltd. Electronic device and method for inference binary and ternary neural networks
KR20210071471A (en) * 2019-12-06 2021-06-16 삼성전자주식회사 Apparatus and method for performing matrix multiplication operation of neural network
CN111161705B (en) * 2019-12-19 2022-11-18 寒武纪(西安)集成电路有限公司 Voice conversion method and device
CN111126582B (en) * 2019-12-20 2024-04-05 上海寒武纪信息科技有限公司 Data processing method and related product
US11750699B2 (en) 2020-01-15 2023-09-05 Mellanox Technologies, Ltd. Small message aggregation
US11252027B2 (en) 2020-01-23 2022-02-15 Mellanox Technologies, Ltd. Network element supporting flexible data reduction operations
US10713493B1 (en) * 2020-02-06 2020-07-14 Shenzhen Malong Technologies Co., Ltd. 4D convolutional neural networks for video recognition
CN113743598B (en) * 2020-05-27 2023-08-04 杭州海康威视数字技术股份有限公司 Method and device for determining operation mode of AI chip
US11876885B2 (en) 2020-07-02 2024-01-16 Mellanox Technologies, Ltd. Clock queue with arming and/or self-arming features
CN112491555B (en) * 2020-11-20 2022-04-05 山西智杰软件工程有限公司 Medical electronic signature processing method and electronic equipment
CN112416433B (en) * 2020-11-24 2023-01-17 中科寒武纪科技股份有限公司 Data processing device, data processing method and related product
US11556378B2 (en) 2020-12-14 2023-01-17 Mellanox Technologies, Ltd. Offloading execution of a multi-task parameter-dependent operation to a network device
CN112953701B (en) * 2021-02-04 2023-10-31 沈阳建筑大学 Four-dimensional chaotic circuit device
CN112799598B (en) * 2021-02-08 2022-07-15 清华大学 Data processing method, processor and electronic equipment
CN113240570B (en) * 2021-04-13 2023-01-06 华南理工大学 GEMM operation accelerator and GoogLeNet-based image processing acceleration method
CN112990370B (en) * 2021-04-26 2021-09-10 腾讯科技(深圳)有限公司 Image data processing method and device, storage medium and electronic equipment
CN115481713A (en) * 2021-06-15 2022-12-16 瑞昱半导体股份有限公司 Method for improving convolution neural network to calculate
KR20230068572A (en) * 2021-11-11 2023-05-18 삼성전자주식회사 Connection circuits in memory arrays
CN116150555A (en) * 2021-11-19 2023-05-23 中科寒武纪科技股份有限公司 Computing device, method for implementing convolution operation by utilizing computing device and related product
CN114936633B (en) * 2022-06-15 2023-06-30 北京爱芯科技有限公司 Data processing unit for transposition operation and image transposition operation method
US11922237B1 (en) 2022-09-12 2024-03-05 Mellanox Technologies, Ltd. Single-step collective operations

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866855A (en) * 2015-05-07 2015-08-26 华为技术有限公司 Image feature extraction method and apparatus
CN104992430A (en) * 2015-04-14 2015-10-21 杭州奥视图像技术有限公司 Fully-automatic three-dimensional liver segmentation method based on convolution nerve network
CN106446546A (en) * 2016-09-23 2017-02-22 西安电子科技大学 Meteorological data complement method based on automatic convolutional encoding and decoding algorithm
CN106504232A (en) * 2016-10-14 2017-03-15 北京网医智捷科技有限公司 A kind of pulmonary nodule automatic testing method based on 3D convolutional neural networks
WO2017106469A1 (en) * 2015-12-15 2017-06-22 The Regents Of The University Of California Systems and methods for analyzing perfusion-weighted medical imaging using deep neural networks

Family Cites Families (86)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5023833A (en) * 1987-12-08 1991-06-11 California Institute Of Technology Feed forward neural network for unary associative memory
US5956703A (en) * 1995-07-28 1999-09-21 Delco Electronics Corporation Configurable neural network integrated circuit
JPH117438A (en) * 1997-06-18 1999-01-12 Fuji Xerox Co Ltd Method and device for processing product sum operation and recording medium
JP2001188767A (en) * 1999-12-28 2001-07-10 Fuji Xerox Co Ltd Neutral network arithmetic unit and method
US7672952B2 (en) * 2000-07-13 2010-03-02 Novell, Inc. System and method of semantic correlation of rich content
US6925479B2 (en) * 2001-04-30 2005-08-02 Industrial Technology Research Institute General finite-field multiplier and method of the same
US7065544B2 (en) * 2001-11-29 2006-06-20 Hewlett-Packard Development Company, L.P. System and method for detecting repetitions in a multimedia stream
US7737994B1 (en) * 2003-09-26 2010-06-15 Oracle America, Inc. Large-kernel convolution using multiple industry-standard graphics accelerators
US20050125477A1 (en) * 2003-12-04 2005-06-09 Genov Roman A. High-precision matrix-vector multiplication on a charge-mode array with embedded dynamic memory and stochastic method thereof
US7634137B2 (en) * 2005-10-14 2009-12-15 Microsoft Corporation Unfolded convolution for fast feature extraction
GB2453263A (en) * 2006-05-16 2009-04-01 Douglas S Greer System and method for modeling the neocortex and uses therefor
US8644643B2 (en) * 2006-06-14 2014-02-04 Qualcomm Incorporated Convolution filtering in a graphics processor
JP4942095B2 (en) * 2007-01-25 2012-05-30 インターナショナル・ビジネス・マシーンズ・コーポレーション Technology that uses multi-core processors to perform operations
US20080288756A1 (en) * 2007-05-18 2008-11-20 Johnson Timothy J "or" bit matrix multiply vector instruction
US8190543B2 (en) * 2008-03-08 2012-05-29 Tokyo Electron Limited Autonomous biologically based learning tool
WO2010043401A2 (en) * 2008-10-15 2010-04-22 Martin Vorbach Data processing device
US20100122070A1 (en) * 2008-11-07 2010-05-13 Nokia Corporation Combined associative and distributed arithmetics for multiple inner products
US20110025816A1 (en) * 2009-07-31 2011-02-03 Microsoft Corporation Advertising as a real-time video call
US8577950B2 (en) * 2009-08-17 2013-11-05 International Business Machines Corporation Matrix multiplication operations with data pre-conditioning in a high performance computing architecture
US8583896B2 (en) * 2009-11-13 2013-11-12 Nec Laboratories America, Inc. Massively parallel processing core with plural chains of processing elements and respective smart memory storing select data received from each chain
US20110314256A1 (en) * 2010-06-18 2011-12-22 Microsoft Corporation Data Parallel Programming Model
US8577820B2 (en) * 2011-03-04 2013-11-05 Tokyo Electron Limited Accurate and fast neural network training for library-based critical dimension (CD) metrology
US10078620B2 (en) * 2011-05-27 2018-09-18 New York University Runtime reconfigurable dataflow processor with multi-port memory access module
CN102214160B (en) * 2011-07-08 2013-04-17 中国科学技术大学 Single-accuracy matrix multiplication optimization method based on loongson chip 3A
CN103631761B (en) * 2012-08-29 2018-02-27 睿励科学仪器(上海)有限公司 Parallel processing architecture carries out matrix operation and for the method for strict ripple coupling analysis
DE102013104567A1 (en) * 2013-05-03 2014-11-06 Infineon Technologies Ag Chip arrangement, chip card arrangement and method for producing a chip arrangement
CN103440121B (en) * 2013-08-20 2016-06-29 中国人民解放军国防科学技术大学 A kind of triangular matrix multiplication vectorization method of vector processor-oriented
DE102013109200A1 (en) * 2013-08-26 2015-02-26 Infineon Technologies Austria Ag Chip, chip arrangement and method of manufacturing a chip
CN107451077B (en) * 2013-08-27 2020-08-18 珠海艾派克微电子有限公司 Test head, chip processing device and method for displaying chip type
US20150324686A1 (en) * 2014-05-12 2015-11-12 Qualcomm Incorporated Distributed model learning
CN104036451B (en) * 2014-06-20 2018-12-11 深圳市腾讯计算机系统有限公司 Model method for parallel processing and device based on multi-graphics processor
CN104317352B (en) * 2014-10-13 2017-10-24 中国科学院光电技术研究所 A kind of adaptive optics control system quickly goes tilt component processing method
CN104346318B (en) * 2014-10-15 2017-03-15 中国人民解放军国防科学技术大学 Matrix Multiplication accelerated method towards general multi-core DSP
CN104463324A (en) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 Convolution neural network parallel processing method based on large-scale high-performance cluster
CN105701120B (en) * 2014-11-28 2019-05-03 华为技术有限公司 The method and apparatus for determining semantic matching degree
US10489703B2 (en) 2015-05-20 2019-11-26 Nec Corporation Memory efficiency for convolutional neural networks operating on graphics processing units
US10417555B2 (en) * 2015-05-29 2019-09-17 Samsung Electronics Co., Ltd. Data-optimized neural network traversal
CN104866904B (en) * 2015-06-16 2019-01-01 中电科软件信息服务有限公司 A kind of BP neural network parallel method of the genetic algorithm optimization based on spark
CN106293893B (en) * 2015-06-26 2019-12-06 阿里巴巴集团控股有限公司 Job scheduling method and device and distributed system
CN105005911B (en) * 2015-06-26 2017-09-19 深圳市腾讯计算机系统有限公司 The arithmetic system and operation method of deep neural network
CN105608490B (en) * 2015-07-29 2018-10-26 上海磁宇信息科技有限公司 Cellular array computing system and communication means therein
US10970617B2 (en) * 2015-08-21 2021-04-06 Institute Of Automation Chinese Academy Of Sciences Deep convolutional neural network acceleration and compression method based on parameter quantification
CN105260776B (en) * 2015-09-10 2018-03-27 华为技术有限公司 Neural network processor and convolutional neural networks processor
CN106548124B (en) * 2015-09-17 2021-09-07 松下知识产权经营株式会社 Theme estimation system and theme estimation method
EP3154001B1 (en) * 2015-10-08 2019-07-17 VIA Alliance Semiconductor Co., Ltd. Neural network unit with neural memory and array of neural processing units that collectively shift row of data received from neural memory
CN106447035B (en) * 2015-10-08 2019-02-26 上海兆芯集成电路有限公司 Processor with variable rate execution unit
CN105608056A (en) * 2015-11-09 2016-05-25 南京大学 Flink based large-scale matrix parallelization computing method
CN105373517A (en) * 2015-11-09 2016-03-02 南京大学 Spark-based distributed matrix inversion parallel operation method
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN105488565A (en) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 Calculation apparatus and method for accelerator chip accelerating deep neural network algorithm
US10482380B2 (en) * 2015-12-30 2019-11-19 Amazon Technologies, Inc. Conditional parallel processing in fully-connected neural networks
CN111353588B (en) * 2016-01-20 2024-03-05 中科寒武纪科技股份有限公司 Apparatus and method for performing artificial neural network reverse training
CN107506828B (en) * 2016-01-20 2020-11-03 中科寒武纪科技股份有限公司 Artificial neural network computing device and method for sparse connection
CN106991476B (en) * 2016-01-20 2020-04-10 中科寒武纪科技股份有限公司 Apparatus and method for performing artificial neural network forward operations
CN108416436B (en) * 2016-04-18 2021-06-01 中国科学院计算技术研究所 Method and system for neural network partitioning using multi-core processing module
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
CN105956659B (en) * 2016-05-11 2019-11-22 北京比特大陆科技有限公司 Data processing equipment and system, server
US10796220B2 (en) * 2016-05-24 2020-10-06 Marvell Asia Pte, Ltd. Systems and methods for vectorized FFT for multi-dimensional convolution operations
CN109416754B (en) * 2016-05-26 2020-06-23 多伦多大学管理委员会 Accelerator for deep neural network
CN106126481B (en) * 2016-06-29 2019-04-12 华为技术有限公司 A kind of computing system and electronic equipment
CN106203621B (en) * 2016-07-11 2019-04-30 北京深鉴智能科技有限公司 The processor calculated for convolutional neural networks
CN106228240B (en) * 2016-07-30 2020-09-01 复旦大学 Deep convolution neural network implementation method based on FPGA
US10891538B2 (en) * 2016-08-11 2021-01-12 Nvidia Corporation Sparse convolutional neural network accelerator
US20180046903A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)
CN106407561B (en) * 2016-09-19 2020-07-03 复旦大学 Method for dividing parallel GPDT algorithm on multi-core SOC
CN106650922B (en) * 2016-09-29 2019-05-03 清华大学 Hardware neural network conversion method, computing device, software and hardware cooperative system
US9779786B1 (en) * 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
KR102224510B1 (en) * 2016-12-09 2021-03-05 베이징 호라이즌 인포메이션 테크놀로지 컴퍼니 리미티드 Systems and methods for data management
CN106844294B (en) * 2016-12-29 2019-05-03 华为机器有限公司 Convolution algorithm chip and communication equipment
US20180189229A1 (en) * 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Deep convolutional network heterogeneous architecture
IT201700008949A1 (en) * 2017-01-27 2018-07-27 St Microelectronics Srl OPERATING PROCEDURE FOR NEURAL NETWORKS, NETWORK, EQUIPMENT AND CORRESPONDENT COMPUTER PRODUCT
CN106940815B (en) * 2017-02-13 2020-07-28 西安交通大学 Programmable convolutional neural network coprocessor IP core
CN106951395B (en) * 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 Parallel convolution operations method and device towards compression convolutional neural networks
US11132599B2 (en) * 2017-02-28 2021-09-28 Microsoft Technology Licensing, Llc Multi-function unit for programmable hardware nodes for neural network processing
CN107066239A (en) * 2017-03-01 2017-08-18 智擎信息系统(上海)有限公司 A kind of hardware configuration for realizing convolutional neural networks forward calculation
US10528147B2 (en) * 2017-03-06 2020-01-07 Microsoft Technology Licensing, Llc Ultrasonic based gesture recognition
CN110312992A (en) * 2017-03-20 2019-10-08 英特尔公司 For piece matrix multiplication and cumulative system, method and apparatus
CN106970896B (en) * 2017-03-30 2020-05-12 中国人民解放军国防科学技术大学 Vector processor-oriented vectorization implementation method for two-dimensional matrix convolution
US10186011B2 (en) * 2017-04-28 2019-01-22 Intel Corporation Programmable coarse grained and sparse matrix compute hardware with advanced scheduling
US10169298B1 (en) * 2017-05-11 2019-01-01 NovuMind Limited Native tensor processor, using outer product unit
WO2018222896A1 (en) * 2017-05-31 2018-12-06 Intel Corporation Gradient-based training engine for quaternion-based machine-learning systems
US10167800B1 (en) * 2017-08-18 2019-01-01 Microsoft Technology Licensing, Llc Hardware node having a matrix vector unit with block-floating point processing
US10963780B2 (en) * 2017-08-24 2021-03-30 Google Llc Yield improvements for three-dimensionally stacked neural network accelerators
US20190102671A1 (en) * 2017-09-29 2019-04-04 Intel Corporation Inner product convolutional neural network accelerator
US11222256B2 (en) * 2017-10-17 2022-01-11 Xilinx, Inc. Neural network processing system having multiple processors and a neural network accelerator

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992430A (en) * 2015-04-14 2015-10-21 杭州奥视图像技术有限公司 Fully-automatic three-dimensional liver segmentation method based on convolution nerve network
CN104866855A (en) * 2015-05-07 2015-08-26 华为技术有限公司 Image feature extraction method and apparatus
WO2017106469A1 (en) * 2015-12-15 2017-06-22 The Regents Of The University Of California Systems and methods for analyzing perfusion-weighted medical imaging using deep neural networks
CN106446546A (en) * 2016-09-23 2017-02-22 西安电子科技大学 Meteorological data complement method based on automatic convolutional encoding and decoding algorithm
CN106504232A (en) * 2016-10-14 2017-03-15 北京网医智捷科技有限公司 A kind of pulmonary nodule automatic testing method based on 3D convolutional neural networks

Also Published As

Publication number Publication date
KR102481256B1 (en) 2022-12-23
US11409535B2 (en) 2022-08-09
CN110083390B (en) 2020-08-25
EP3654208A1 (en) 2020-05-20
EP3651030A1 (en) 2020-05-13
CN109729734B (en) 2020-10-27
CN111860815A (en) 2020-10-30
KR102477404B1 (en) 2022-12-13
WO2019041251A1 (en) 2019-03-07
US20200057648A1 (en) 2020-02-20
CN109729734B8 (en) 2020-11-24
CN110222308A (en) 2019-09-10
EP3654209A1 (en) 2020-05-20
US20200057652A1 (en) 2020-02-20
TW201913460A (en) 2019-04-01
JP7065877B2 (en) 2022-05-12
EP3605402A1 (en) 2020-02-05
US11354133B2 (en) 2022-06-07
JP2020530916A (en) 2020-10-29
US20200057647A1 (en) 2020-02-20
CN110245751A (en) 2019-09-17
CN109729734A (en) 2019-05-07
US11347516B2 (en) 2022-05-31
CN109902804A (en) 2019-06-18
CN110222308B (en) 2020-12-29
CN110245751B (en) 2020-10-09
US20200057650A1 (en) 2020-02-20
EP3651031A1 (en) 2020-05-13
CN110245752B (en) 2020-10-09
US11334363B2 (en) 2022-05-17
EP3654210A1 (en) 2020-05-20
KR20200008544A (en) 2020-01-28
EP3605402B1 (en) 2022-08-31
US11561800B2 (en) 2023-01-24
KR102467688B1 (en) 2022-11-15
US20200057651A1 (en) 2020-02-20
US20190065208A1 (en) 2019-02-28
CN110231958B (en) 2020-10-27
US11531553B2 (en) 2022-12-20
CN110083390A (en) 2019-08-02
CN110231958A (en) 2019-09-13
EP3605402A4 (en) 2020-10-21
US11775311B2 (en) 2023-10-03
US20200057649A1 (en) 2020-02-20
KR20200037749A (en) 2020-04-09
CN110245752A (en) 2019-09-17
TWI749249B (en) 2021-12-11
KR20200037748A (en) 2020-04-09

Similar Documents

Publication Publication Date Title
CN109902804B (en) Pooling operation method and device
CN109615061B (en) Convolution operation method and device
JP6888073B2 (en) Chip equipment and related products
JP6888074B2 (en) Chip equipment and related products
CN109615062B (en) Convolution operation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant after: Zhongke Cambrian Technology Co., Ltd

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Beijing Zhongke Cambrian Technology Co., Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201117

Address after: Room 611-194, R & D center building, China (Hefei) international intelligent voice Industrial Park, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Anhui Cambrian Information Technology Co., Ltd

Address before: 100000 room 644, No. 6, No. 6, South Road, Beijing Academy of Sciences

Applicant before: Zhongke Cambrian Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant