WO2019129302A1 - 集成电路芯片装置及相关产品 - Google Patents
集成电路芯片装置及相关产品 Download PDFInfo
- Publication number
- WO2019129302A1 WO2019129302A1 PCT/CN2018/125801 CN2018125801W WO2019129302A1 WO 2019129302 A1 WO2019129302 A1 WO 2019129302A1 CN 2018125801 W CN2018125801 W CN 2018125801W WO 2019129302 A1 WO2019129302 A1 WO 2019129302A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- processing circuit
- circuit
- connection relationship
- basic
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Definitions
- the present disclosure relates to the field of neural networks, and more particularly to an integrated circuit chip device and related products.
- ANN Artificial Neural Network
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- Embodiments of the present disclosure provide an integrated circuit chip device and related products, which can improve the processing speed of the computing device and improve efficiency.
- an integrated circuit chip device includes: a main processing circuit, k branch circuits, and k groups of basic processing circuits, wherein the main processing circuit and the k branch circuits are respectively connected.
- Each of the k branch circuits is in one-to-one correspondence with a set of basic processing circuits of the k sets of basic processing circuits, the set of basic processing circuits including at least one basic processing circuit;
- the branch circuit includes: a compression mapping circuit for performing compression processing of each data in the neural network operation;
- the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the k branch circuits connected thereto;
- the k branch circuits are configured to forward the transmission data between the main processing circuit and the k group basic circuit, and control whether to start the compression mapping circuit to perform compression processing on the transmission data according to the operation of the transmission data;
- the k basic processing circuits are configured to perform operations in the neural network in a parallel manner according to the transmission data or the compressed transmission data, and transmit the operation result to the main processing circuit.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits;
- the basic processing circuit includes: a compression mapping circuit; and the compression mapping circuit is configured to perform compression processing of each data in the neural network operation;
- the main processing circuit is configured to perform each successive operation in the neural network operation and transmit data to the plurality of basic processing circuits;
- the plurality of basic processing circuits are configured to: according to the operation control of the transmission data, whether to start the compression mapping circuit to perform compression processing on the transmission data; and execute the parallel transmission according to the transmission data or the compressed transmission data.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to k basic processing circuits of the plurality of basic processing circuits, the k
- the basic circuits are: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column;
- the plurality of basic processing circuits include: a compression mapping circuit for performing compression processing of each data in the neural network operation;
- the main processing circuit is configured to perform each successive operation in the neural network operation and to transmit data with the k basic processing circuits;
- the k basic processing circuits for data forwarding between the main processing circuit and a plurality of basic processing circuits
- the plurality of basic processing circuits are configured to determine whether to start the compression mapping circuit to perform compression processing on the transmission data according to operation control of the transmission data, and perform operations in the neural network in parallel according to the compressed transmission data, The result of the operation is transmitted to the main processing circuit.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to k basic processing circuits of the plurality of basic processing circuits, the k
- the basic circuits are: n basic processing circuits in the first row and m basic processing circuits in the first column;
- the k basic processing circuits include: a compression mapping circuit for performing compression processing of each data in the neural network operation;
- the main processing circuit is configured to perform each successive operation in the neural network operation and transmit the data to the basic processing circuit connected thereto;
- the k basic processing circuits are configured to determine, according to operation control of the transmission data, whether to start the compression mapping circuit to perform compression processing on the transmission data, and send the compressed transmission data to the k basic processing Basic processing circuit for circuit connection;
- the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the compressed data after the compression processing, and transmit the operation result to the main processing circuit.
- a neural network computing device comprising the integrated circuit chip device provided by any of the first to fourth aspects.
- a combined processing device includes: a neural network computing device provided by the fifth aspect, a universal interconnect interface, and a general processing device;
- the neural network computing device is coupled to the general purpose processing device via the universal interconnect interface.
- a chip is provided, the chip integrating the apparatus provided in any one of the first to sixth aspects above.
- an electronic device comprising the chip of the seventh aspect.
- the ninth aspect provides a method for computing a neural network, the method being applied to an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip according to any one of the first aspect to the fourth aspect Apparatus for performing an operation of a neural network.
- the compression mapping circuit is provided to compress the data block and then perform operations, thereby saving transmission resources and computing resources, so that it has the advantages of low power consumption and small calculation amount.
- 1a is a schematic structural view of an integrated circuit chip device.
- FIG. 1b is a schematic structural view of another integrated circuit chip device.
- Figure 1c is a schematic structural view of a basic processing circuit.
- FIG. 1 is a schematic diagram of a partial structure of a compression mapping circuit according to an embodiment of the present application.
- FIG. 1e is a schematic structural diagram of a neural network according to an embodiment of the present application.
- FIG. 1 is a schematic partial structural diagram of another compression mapping circuit according to an embodiment of the present disclosure.
- FIG. 1g is a schematic partial structural diagram of another compression mapping circuit according to an embodiment of the present application.
- FIG. 1h is a schematic diagram of a partial structure of another compression mapping circuit according to an embodiment of the present application.
- FIG. 1 is a schematic partial structural diagram of another compression mapping circuit according to an embodiment of the present disclosure.
- FIG. 1 is a schematic diagram of a partial structure of another compression mapping circuit according to an embodiment of the present application.
- FIG. 1 is a schematic partial structural diagram of another compression mapping circuit according to an embodiment of the present disclosure.
- Figure 2 is a schematic diagram of a matrix multiplied by a vector flow.
- Figure 2a is a schematic diagram of a matrix multiplied by a vector.
- Figure 2b is a schematic diagram of a matrix multiplied by a matrix flow.
- Figure 2c is a schematic diagram of matrix Ai multiplied by vector B.
- Figure 2d is a schematic diagram of matrix A multiplied by matrix B.
- Figure 2e is a schematic diagram of matrix Ai multiplied by matrix B.
- Figure 3a is a schematic diagram of neural network training.
- Figure 3b is a schematic diagram of a convolution operation.
- 4a is a schematic structural view of another integrated circuit chip device.
- 4b is a schematic structural view of another integrated circuit chip device.
- Fig. 4c is a schematic structural view of a basic processing circuit.
- Figure 5a is a schematic diagram of the use of a basic processing circuit.
- Figure 5b is a schematic diagram of transmission data of a main processing circuit.
- Figure 5c is a schematic diagram of a matrix multiplied by a vector.
- Figure 5d is a schematic structural view of an integrated circuit chip device.
- Figure 5e is a schematic diagram showing the structure of another integrated circuit chip device.
- Figure 5f is a schematic diagram of a matrix multiplied by a matrix.
- Figure 6a is a schematic diagram of convolution input data.
- Figure 6b is a schematic diagram of a convolution kernel.
- Figure 6c is a schematic diagram of the operation window of a three-dimensional data block of input data.
- Figure 6d is a schematic diagram of another operational window of a three-dimensional data block of input data.
- Figure 6e is a schematic diagram of still another operational window of a three-dimensional data block of input data.
- FIG. 7 is a schematic structural diagram of a neural network chip according to an embodiment of the present application.
- the main processing circuit is configured to perform each successive operation in a neural network operation and transmit data to the plurality of basic processing circuits;
- the k group basic processing circuit is configured to transmit data according to the The operations in the neural network are performed in parallel and the results of the operations are transmitted to the main processing circuit.
- the apparatus further includes k branch circuits, the main processing circuit and the k branch circuits are respectively connected, and each of the k branch circuits corresponds to a group of bases in the k group basic processing circuits Processing circuitry for forwarding transmission data between the primary processing circuitry and the k sets of base processing circuitry.
- the basic processing circuit includes a compression mapping circuit, the compression mapping circuit is configured to perform compression processing of each data in a neural network operation, and the k-group basic processing circuit is specifically configured to Whether the operation control of the transmission data starts the compression mapping circuit to perform compression processing on the transmission data; performs the operation in the neural network in parallel according to the transmission data or the compressed transmission data, and transmits the operation result to The main processing circuit.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to a circuit connected thereto, and broadcasting the broadcast data block to a circuit connected thereto; the basic processing circuit for And performing the compression mapping circuit to perform compression processing on the basic data block and the broadcast data block according to the operation control, and then performing an inner product operation to obtain an operation result, and sending the operation result to the main processing circuit; a processing circuit, configured to process the operation result to obtain the data block to be calculated and an instruction result of the operation instruction; wherein the data block to be calculated is at least one input neuron to be calculated, and/or at least A weight.
- the branching circuit includes: a compression mapping circuit for performing compression processing of each data in the neural network operation; the main processing circuit for performing each successive operation in the neural network operation and And transmitting, by the k branch circuits connected thereto, the k branch circuits, configured to forward the transmission data between the main processing circuit and the k group basic circuit, and control whether to start according to the operation of the transmission data
- the compression mapping circuit performs compression processing on the transmission data; the k basic processing circuits are configured to perform operations in the neural network in parallel according to the transmission data or the compressed transmission data, and transmit the operation result Give the main processing circuit.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the k branch circuits connected thereto, and broadcasting the broadcast data block to the k branches connected thereto
- the k branch circuits are configured to receive the basic data block and the broadcast data block, start the compression mapping circuit to perform compression processing on the basic data block and the broadcast data block; compress the processed basic data block and compress the processed data block
- the broadcast data block is forwarded to the k-group basic processing circuit, and the basic processing circuit is configured to perform an inner product operation on the compressed basic data block and the compressed broadcast data block to obtain an operation result, and the operation result is The operation result is sent to the main processing circuit; the main processing circuit is configured to process the operation result to obtain the data block to be calculated And an instruction result of the operation instruction; wherein the distribution data block and
- the main processing circuit is specifically configured to broadcast the broadcast data block to the k branch circuits by one time.
- the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the K branch circuits through multiple times. .
- the basic processing circuit is configured to perform inner product processing on the partial broadcast data block and the basic data block to obtain an inner product processing result, and accumulate the inner product processing result.
- the partial operation result is sent to the main processing circuit.
- the basic processing circuit is specifically configured to multiplex the partial broadcast data block to perform the partial broadcast data block and the n basic data block inner product operations to obtain n partial processing results, and The partial partial processing results are respectively accumulated to obtain n partial operation results, and the n partial operation results are sent to the main processing circuit, and the n is an integer greater than or equal to 2.
- the main processing circuit includes: a main register or a main on-chip buffer circuit;
- branch circuit includes: a basic register or a basic on-chip buffer circuit
- the basic processing circuit includes: a basic register or a basic on-chip buffer circuit.
- the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transposition circuit, a direct memory access circuit, a compression mapping circuit, or a data rearrangement circuit.
- a vector operator circuit an arithmetic logic unit circuit
- an accumulator circuit an accumulator circuit
- a matrix transposition circuit a direct memory access circuit
- a compression mapping circuit or a data rearrangement circuit.
- a data rearrangement circuit Kind or any combination.
- the data is one or any combination of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, and an n-dimensional data block.
- the operation instruction is a multiplication instruction
- the main processing circuit determines that the multiplier data block is a broadcast data block, and the multiplicand data block is a distribution data block;
- the main processing circuit determines that the input data block is a broadcast data block, and the convolution kernel is a distribution data block.
- the operation of the neural network involved in the present application includes: convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation, and activation operation. Or any combination.
- FIG. 1a is a schematic structural diagram of an integrated circuit chip device.
- the chip device includes: a main processing circuit, a basic processing circuit, and a branch processing circuit (optional).
- a processing circuit, the main processing circuit and the k branch circuits are respectively connected, and each of the k branch circuits corresponds to a group of basic processing circuits of the k group basic processing circuits, and the set of basic processing circuits includes at least A basic processing circuit.
- the compression mapping circuit can be disposed in the basic processing circuit or the branch circuit as shown by the dashed box. The compression mapping circuit is used to compress the data, as described in the following application.
- the main processing circuit may include a register and/or an on-chip buffer circuit, and the main processing circuit may further include: a control circuit, a vector operator circuit, an arithmetic and logic unit (ALU) circuit, and accumulation. Circuit, direct memory access (DMA) circuit and other circuits, of course, in practical applications, the above main processing circuit can also be added, conversion circuit (such as matrix transposition circuit), data rearrangement circuit or activation circuit, etc. And other circuits;
- DMA direct memory access
- the main processing circuit may include: a compression mapping circuit, where the compression mapping circuit may be used to perform compression processing on the received or transmitted data, and in actual applications, for example, data that is 0 or smaller than a preset threshold (such as 0.1) may be performed. Eliminated.
- the preset threshold is customized on the user side or the terminal device side, for example, 0.1, 0.05, and the like. This application does not limit the specific form of the above compression mapping circuit.
- the compression process will be specifically explained below.
- the main processing circuit further includes a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit can integrate the data distributing circuit and the data broadcasting circuit.
- the data distributing circuit and the data broadcasting circuit can also be separately set; in practical applications
- the above data transmitting circuit and data receiving circuit may also be integrated to form a data transmitting and receiving circuit.
- the main processing circuit needs to transmit the broadcast data to the data of each of the basic processing circuits.
- the main processing circuit needs to selectively send the distribution data to the data of a part of the basic processing circuit, and the specific selection manner may be specifically determined by the main processing circuit according to the load and the calculation manner.
- the broadcast data is transmitted to each of the basic processing circuits in a broadcast form.
- the broadcast data is sent to each of the basic processing circuits by means of one broadcast, and the broadcast data may be sent to each of the basic processing circuits by means of multiple broadcasts.
- the specific embodiment of the present application does not limit the above.
- the number of times of broadcasting for the distribution transmission method, the distribution data is selectively transmitted to the partial basic processing circuit.
- the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and the data may be the same or different. Specifically, if the data is transmitted by means of distribution, the data received by the basic processing circuit of each received data may be different, and of course, the data received by some basic processing circuits may be the same;
- the control circuit of the main processing circuit transmits data to some or all of the basic processing circuits, and the basic processing circuit of each received data can receive the same data.
- the vector operator circuit of the main processing circuit may perform vector operations, including but not limited to: two vector addition, subtraction, multiplication and division, vector and constant addition, subtraction, multiplication, division, or each element in the vector.
- the continuous operation may specifically be: vector and constant addition, subtraction, multiplication, division, activation, accumulation, and the like.
- Each of the base processing circuits may include a base register and/or a base on-chip buffer circuit; each of the base processing circuits may further include one or any combination of an inner product operator circuit, a vector operator circuit, an accumulator circuit, and the like.
- the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be integrated circuits, and the inner product operator circuit, the vector operator circuit, and the accumulator circuit may be separately provided circuits.
- the chip device may optionally further comprise one or more branch processing circuits, such as the chip device having a branch processing circuit, wherein the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to the basic processing circuit, the basic processing
- the inner product operator circuit of the circuit is configured to perform an inner product operation between the data blocks
- the control circuit of the main processing circuit controls the data receiving circuit or the data transmitting circuit to send and receive external data
- the external data is distributed by the control circuit controlling the data transmitting circuit
- a branch processing circuit for transmitting and receiving data of a main processing circuit or a basic processing circuit may optionally further comprise one or more branch processing circuits, such as the chip device having a branch processing circuit, wherein the main processing circuit is connected to the branch processing circuit, and the branch processing circuit is connected to the basic processing circuit, the basic processing
- the inner product operator circuit of the circuit is configured to perform an inner product operation between the data blocks
- the control circuit of the main processing circuit controls the data receiving circuit or the data transmitting circuit to send and receive external data
- the structure shown in Figure 1a is suitable for the calculation of complex data, because for the main processing circuit, the number of connected cells is limited, so it is necessary to add a branch processing circuit between the main processing circuit and the basic processing circuit to achieve more
- the basic processing circuit is accessed to enable calculation of complex data blocks.
- the connection structure of the branch processing circuit and the basic processing circuit may be arbitrary, and is not limited to the H-type structure of FIG. 1a.
- the main processing circuit to the basic processing circuit is a broadcast or distributed structure
- the basic processing circuit to the main processing circuit is a Gather structure.
- the definition of the broadcast structure, the distribution structure and the collection structure is as follows: for the distribution or broadcast structure, the number of basic processing circuits at this time is larger than that of the main processing circuit, that is, one main processing circuit corresponds to a plurality of basic processing circuits, that is, from the main processing circuit to The plurality of basic processing circuits are broadcast or distributed structures, and conversely, the plurality of basic processing circuits to the main processing circuits may be a collection structure.
- the basic processing circuit receives the data distributed or broadcast by the main processing circuit and stores it in the on-chip buffer of the basic processing circuit, and can perform the operation to generate the result, and can transmit the data to the main processing circuit.
- the data involved in the basic processing circuit may be the compressed data, and the specific implementation involved in the compression processing will be described later.
- each of the basic processing circuits may include a compression mapping circuit, or a partial mapping processing circuit may be configured to compress the mapping circuit; and the compression mapping circuit may be configured to compress the received or transmitted data.
- This application does not limit the specific form of the above compression mapping circuit.
- the vector operator circuit of the basic processing circuit may perform a vector operation on the two compressed vectors.
- the inner product operator circuit of the basic processing circuit may perform two vectors after compression processing. The inner product operation is performed, and the accumulator circuit can also accumulate the result of the inner product operation.
- the two vectors can be stored in on-chip buffers and/or registers, and the underlying processing circuitry can extract two vectors to perform the operations as needed for the actual computation.
- the operations include, but are not limited to, inner product operations, multiplication operations, addition operations, or other operations.
- the result of the inner product operation can be added to the on-chip buffer and/or the register; the advantage of the alternative is that the amount of data transfer between the basic processing circuit and the main processing circuit is reduced, and the number of data is increased. The efficiency of the operation reduces the power consumption of data transmission.
- the result of the inner product operation is not accumulated and directly transmitted as a result; the advantage of this technical solution is that the calculation amount inside the basic processing circuit is reduced, and the operation efficiency of the basic processing circuit is improved.
- each of the basic processing circuits may perform inner product operations of multiple sets of two vectors, or may accumulate results of multiple sets of inner product operations separately;
- multiple sets of two vector data can be stored in an on-chip buffer and/or register
- results of the multiple sets of inner product operations may be separately added to the on-chip buffer and/or registers
- results of the inner product operations of each group may be directly accumulated as a result without accumulating;
- each of the basic processing circuits may perform an inner product operation operation of the same vector and the plurality of vectors ("one-to-many" inner product, that is, two vectors of each group in the plurality of inner products. There is a vector in the shared), and the inner product results corresponding to each vector are separately accumulated.
- the technical solution can realize the same set of weights to perform multiple calculations on different input data, increase data multiplexing, reduce data transmission amount of internal processing circuit internal data, improve calculation efficiency, and reduce power consumption.
- the data source shared by each group and the other vector of each group may be different:
- the vectors shared by the groups are from the broadcast or distribution of the main processing circuit or the branch processing circuit;
- the vectors shared by each group are from an on-chip cache
- the vectors shared by the groups are from registers
- another non-shared vector of each group is from the broadcast or distribution of the main processing circuit or the branch processing circuit;
- another non-shared vector of each group is from a register
- each set of shared vectors retains any number of copies in the on-chip buffer and/or registers of the underlying processing circuitry when performing multiple sets of inner product operations;
- the shared vector may be reserved for each inner product of each group
- the shared vector can be kept only one copy
- results of the multiple sets of inner product operations may be separately added to the on-chip buffer and/or the register;
- the result of the inner product operation of each group may be directly transmitted as a result without being accumulated;
- the device includes a main processing circuit (which can perform vector operations) and a multi-base processing circuit (which can perform inner product operations).
- the advantage of this combination is that the device can not only perform matrix and vector multiplication operations using the basic processing circuit, but also perform other arbitrary vector operations using the main processing circuit, so that the device can be completed more quickly under the configuration of limited hardware circuits. More calculations reduce the number of data transmissions outside the device, improve computational efficiency, and reduce power consumption.
- the chip can be provided with a compression mapping circuit in the basic processing circuit and/or the main processing circuit, so that the amount of data calculated can be reduced when performing neural network calculation, and the chip can be based on each circuit (mainly the main processing circuit and the basis).
- the calculation amount (load amount) of the processing circuit dynamically allocates which circuit to perform data compression processing, which can reduce complicated programs of data calculation, reduce power consumption, and can dynamically perform compression processing of data distribution without affecting the chip. Computational efficiency.
- the manner of allocation includes, but is not limited to, load balancing, load minimum allocation, and the like.
- the apparatus shown in FIG. 1b includes a main processing circuit and a basic processing circuit, and optionally a branch processing circuit.
- the device shown in FIG. 1b includes: a main processing circuit and N basic processing circuits, wherein the main processing circuit (the specific structure is as shown in FIG. 1c) can be directly or indirectly connected to the N basic processing circuits, such as an indirect connection.
- an optional scheme may include N/4 branch processing circuits as shown in FIG. 1a, and each branch processing circuit is respectively connected with four basic processing circuits, and is respectively included for the main processing circuit and the N basic processing circuits.
- FIG. 1a For the circuit, reference may be made to the description shown in FIG. 1a, and details are not described herein.
- the above basic processing circuit may also be disposed in the branch processing circuit.
- the number of basic processing circuits connected to each branch processing circuit is It can also be limited to four, and the manufacturer can configure it according to actual needs.
- the main processing circuit and/or the N basic processing circuits may each include a compression mapping circuit.
- the main processing circuit may include a compression mapping circuit, or the N basic processing circuits or a part thereof may include a compression mapping circuit. It may also be that the main processing circuit and the N basic processing circuits or a part thereof are included.
- the main processing circuit may dynamically allocate an operation entity of the data compression processing step according to the neural network calculation instruction. Specifically, the main processing circuit may determine whether to perform data compression processing on the received data according to its own load.
- the load may be The value is set to a plurality of intervals, each of which corresponds to an execution subject of the data compression processing step. For example, taking three intervals as an example, the load value of the interval 1 is low, and the data processing step can be separately performed by the main processing circuit. 2 The load value is located between the interval 1 and the interval 3.
- the data compression processing step can be performed by the main processing circuit or the N basic processing circuits.
- the interval 3 load value is high, and the data compression processing steps can be performed by the N basic processing circuits. In this regard, it may be performed in an explicit manner.
- the main processing circuit may be configured with a special indication or instruction.
- the basic processing circuit When the basic processing circuit receives the special indication or instruction, it determines to perform a data compression processing step, such as the basic processing circuit does not receive When a special instruction or instruction is made, it is determined that the data compression processing step is not performed. As another example, it may be performed in an implied manner, for example, when the basic processing circuit receives the sparse data (ie, includes 0, or includes data smaller than a preset threshold greater than a preset number) and determines that an inner product operation needs to be performed, the sparse The data is compressed.
- a data compression processing step such as the basic processing circuit does not receive
- the data compression processing step is not performed.
- it may be performed in an implied manner for example, when the basic processing circuit receives the sparse data (ie, includes 0, or includes data smaller than a preset threshold greater than a preset number) and determines that an inner product operation needs to be performed, the sparse The data is compressed.
- the data in the present application may be an input neuron or a weight in a neural network, which may be a matrix data or a vector data, etc., which is not limited herein. That is, the data or data blocks set forth below in this application may be input neurons or weights in a neural network, which may be embodied in the form of a matrix or a vector.
- the neural network is an algorithm with high computational complexity and high memory access, the more weights, the more computational and memory accesses will increase.
- the weight is small (for example, 0, or less than the weight of the set value)
- it is necessary to compress the data with smaller weights.
- data compression processing is applied in sparse neural networks, and the effect is most obvious, such as reducing the workload of data calculation, reducing data overhead, and increasing data calculation rate.
- the input data includes, but is not limited to, at least one input neuron and/or at least one weight.
- the compression mapping circuit compresses both input neurons and weights
- the compression mapping circuit 101 may perform compression processing on the input data to obtain compressed input data, the input.
- the data includes at least one input neuron and at least one weight, the compressed input data including the compressed input neuron and the weighted compression.
- the input data includes at least one input neuron and at least one weight.
- the compression mapping circuit 101 determines whether the absolute value of each of the at least one input neuron is greater than a first threshold. When the absolute value of the input neuron is less than or equal to the first threshold, the compression mapping circuit 101 deletes the input neuron; when the absolute value of the input neuron is greater than the first threshold, the compression mapping circuit 101 retains The input neuron, the compression mapping circuit 101 outputs the deleted output neuron as an input neuron after compression processing.
- the compression mapping circuit 101 acquires connection relation data of an input neuron, and the connection relationship data of the input neuron indicates position information of the input neuron in which the absolute value of the at least one input neuron is greater than the first threshold.
- the compression mapping circuit 101 determines whether the absolute value of each of the at least one weight is greater than a second threshold. When the absolute value of the weight is less than or equal to the second threshold, the compression mapping circuit 101 deletes the weight, and selects the relevant weight from the deleted weight according to the connection relationship data of the input neuron. Output, as the weight after compression processing.
- the input data includes at least one input neuron and at least one weight.
- the compression mapping circuit 101 determines whether the absolute value of each of the at least one weight is greater than a second threshold. When the absolute value of the weight is less than or equal to the second threshold, the compression mapping circuit 101 deletes the weight; when the absolute value of the weight is greater than the second threshold, the compression mapping circuit 101 retains the weight.
- the compression map circuit 101 outputs the deleted weight as a weight after the compression process.
- the compression map circuit 101 obtains connection relation data of weights, and the connection relationship data of the weights represents data of a connection relationship between the at least one input neuron and the output neurons.
- the compression mapping circuit 101 determines whether the absolute value of each of the at least one input neuron is greater than a first threshold.
- the compression mapping circuit 101 deletes the input neuron, and selects the relevant input neuron from the deleted input neuron according to the connection relationship data of the weight.
- the neuron output is input as the input neuron after compression processing.
- the compression mapping circuit 101 stores the compressed input neurons and the compressed weights in a one-to-one correspondence format in the storage circuit.
- the specific manner in which the compression mapping circuit 101 stores the compression-processed input neurons and the compression-processed weights in a one-to-one correspondence manner is that each of the compression-processed input neurons is The compressed input neuron and its corresponding compressed processed weight are used as a data set, and the data set is stored in the storage circuit.
- the compression mapping circuit 101 includes:
- the first sparse processing unit 1011 is configured to perform compression processing on the second input data to obtain third output data and second output data, and transmit the third output data to the first data processing unit 1012.
- the first data processing unit 1012 is configured to receive the first input data and receive the third output data, and output the first output data according to the third output data and the first input data.
- the first output data is a compressed input input neuron
- the second output data For compressing the processed weight, the third output data is a connection relationship data of weights; when the first input data includes at least one weight, and the second input data includes at least one input neuron, The first output data is a compression processed weight, the second output data is a compressed input neuron, and the third output data is an input neuron connection relationship data.
- the wij represents a weight between the i-th input neuron and the j-th output neuron; the first sparse processing unit 1011.
- the connection relationship data ie, the third output data
- the weights of the weights whose absolute values are less than or equal to the second threshold are deleted, and the weights after the compression processing are obtained (ie, the second output is Data;
- the first sparse processing unit 1011 obtains connection relationship data according to the input neuron, and inputs an absolute value in the input neuron that is less than or equal to the first threshold.
- the neurons are deleted to obtain the compressed input neurons.
- the first threshold may be 0.1, 0.08, 0.05, 0.02, 0.01, 0 or other values.
- the second threshold may be 0.1, 0.08, 0.06, 0.05, 0.02, 0.01, 0 or other values.
- the first threshold value and the second threshold value may or may not coincide with each other.
- connection relationship data may be expressed in the form of a step index or a direct index.
- connection relationship data represented by the direct index form is a character string composed of 0 and 1.
- the second input data is a weight
- 0 indicates that the absolute value of the weight is less than or equal to the second threshold, that is, There is no connection between the input neuron corresponding to the weight and the output neuron
- 1 indicates that the absolute value of the weight is greater than the second threshold, that is, the input neuron corresponding to the weight has a connection with the output neuron.
- connection relation data expressed in direct index form has two representation order: a connection string of 0 and 1 is formed by a connection state of each output neuron and all input neurons to represent a connection relationship of weights; or each input nerve
- the connection state of the element with all the output neurons constitutes a string of 0 and 1 to represent the connection relationship of the weights.
- 0 indicates that the absolute value of the input neuron is less than or equal to the first threshold
- 1 indicates that the absolute value of the input neuron is greater than the first threshold.
- connection relationship data may also be embodied in the form of a vector/matrix or the like, where 0 indicates that the data of the input neuron/weight corresponding to the position is 0 or less than the first threshold; correspondingly, 1 indicates that the position corresponds to The input neuron/weight data is not 0 or greater than the first threshold, etc., and is not limited in this application.
- the connection relationship data of the data may also be referred to as a tag mask matrix/mask vector.
- the connection relationship data represented by the step index is the distance between the input neuron connected to the output neuron and the input neuron connected to the previous output neuron.
- a string consisting of values; when the second input data is an input neuron, the data represented by the step index is an input neuron whose current absolute value is greater than the first threshold and an input whose absolute value is greater than the first threshold.
- FIG. 1e is a schematic diagram of a neural network according to an embodiment of the present application.
- the first input data is an input neuron, including input neurons i1, i2, i3, and i4, and the second input data is a weight.
- the weights are w11, w21, w31 and w41; for the output neuron o2, the weights w12, w22, w32 and w42, wherein the weights w21, w12 and w42 have a value of 0, and their absolute values are Less than the first threshold value 0.01, the first sparse processing unit 1011 determines that the input neuron i2 and the output neuron o1 are not connected, and the input neurons i1 and i4 are not connected to the output neuron o2, and the input neurons i1 and i3 are not connected.
- connection relationship data is represented by the connection state of each output neuron and all the input neurons, and the connection relationship data of the output neuron o1 is "1011", and the connection relationship data of the output neuron o2 is "0110" (ie, the above The connection relationship data is "10110110”); with the connection relationship between each input neuron and all output neurons, the connection relationship data of the input neuron i1 is "10", and the connection relationship data of the input neuron i2 is "01".
- the connection relationship data of the input neuron i3 is "11”
- the connection relationship data of the input neuron i4 is "10” (that is, the above-described connection relationship data is "10011110").
- the compression mapping circuit 101 uses the above i1 and w11, i3 and w31 and i4 and w41 as a data set, respectively, and stores the data set in the storage circuit; for the output neuron o2, the above compression The mapping circuit 101 takes the above-described i2 and w22 and i3 and w32 as a data set, respectively, and stores the data set in the storage circuit.
- the second output data is w11, w31 and w41 for the output neuron o1; and the second output data is w22 and w32 for the output neuron o2.
- the connection relationship data (ie, the third output data) is "1011".
- the second output data is 1, 3, 5.
- the first input data includes input neurons i1, i2, i3, and i4, and the second input data is a weight.
- the weights are w11, w21, w31 and w41; for the output neuron o2, the weights w12, w22, w32 and w42, wherein the weights w21, w12 and w42 have a value of 0, the sparse processing unit 1011 determines that the input neurons i1, i3, and i4 are connected to the output neuron o1, and the input neurons i2 and i3 are connected to the output neuron o2.
- connection relationship data between the output neuron o1 and the input neuron is "021".
- the first number "0" in the connection relationship data indicates that the distance between the first input neuron connected to the output neuron o1 and the first input neuron is 0, that is, the first and output nerves
- the input neuron with element o1 is the input neuron i1;
- the second number "2" in the above connection relation data indicates the second input neuron connected to the output neuron o1 and the first and output neuron o1
- the distance between the connected input neurons (ie, input neuron i1) is 2, that is, the second input neuron connected to the output neuron o1 is the input neuron i3;
- the third number in the above connection relation data "1" indicates that the distance between the third input neuron connected to the output neuron o1 and the second input neuron connected to the output neuron o1 is 1, that is, the third and output neuron o1
- the connected input neuron is the input neuron i4.
- connection relationship data between the output neuron o2 and the input neuron is "11".
- the first number "1" in the connection relationship data indicates that the distance between the first input neuron connected to the output neuron o2 and the first input neuron (ie, the input neuron i1) is
- the first input neuron connected to the output neuron o2 is the output neuron i2;
- the second digit "1" in the connection relation data represents the second input neuron connected to the output neuron o2
- the first input neuron connected to the output neuron o2 has a distance of 1, that is, the second input neuron connected to the output neuron o2 is the input neuron i3.
- the compression mapping circuit 101 uses the above i1 and w11, i3 and w31 and i4 and w41 as a data set, respectively, and stores the data set in the storage circuit; for the output neuron o2, the above compression The mapping circuit 101 takes the above-described i2 and w22 and i3 and w32 as a data set, respectively, and stores the data set in the storage circuit.
- the second output data is w11, w31 and w41 for the output neuron o1; and the second output data is w22 and w32 for the output neuron o2.
- the connection relationship data that is, the third output data is "021"
- the second output data described above is 1, 3, 5.
- the first input data is an input neuron
- the second input data is a weight
- the third output data is a connection relationship data between the output neuron and the input neuron.
- the first data processing unit 1012 culls the input neuron whose absolute value is less than or equal to the second threshold, and extracts the culled input neuron according to the connection relationship data.
- the input neuron associated with the above weight is selected and output as the first output data.
- the input neurons i1, i2, i3, and i4 have values of 1, 0, 3, and 5, respectively, and for the output neuron o1, the third output data (ie, the connection relationship)
- the data is "021”
- the above second output data is w11, w31 and w41.
- the first data processing unit 1012 above rejects the input neurons having the value 0 of the input neurons i1, i2, i3, and i4 to obtain the input neurons i1, i3, and i4.
- the first data processing unit 1012 determines that the input neurons i1, i3, and i4 are both connected to the output neurons according to the third output data "021", so the data processing unit 1012 inputs the input neurons i1, i3.
- i4 is output as the first output data, that is, outputs 1, 3, 5.
- the third output data is connection relation data of the input neuron.
- the first data processing unit 1012 After receiving the weights w11, w21, w31, and w41, the first data processing unit 1012 rejects the weights whose absolute values are smaller than the first threshold, and removes the A weight associated with the input neuron is selected as the first output data and output.
- the weights w11, w21, w31, and w41 have values of 1, 0, 3, and 4, respectively.
- the third output data ie, the connection relationship data
- the above second output data is i1, i3 and i5.
- the first data processing unit 1012 rejects the input neurons whose values are 0 in the weights w11, w21, w31 and w41 to obtain weights w11, w21, w31 and w41.
- the first data processing unit 1012 determines that the value of the input neuron i2 in the input neurons i1, i2, i3, and i4 is 0 according to the third output data "1011", so the first data processing unit 1012 inputs the above input. Neurons 1, 3 and 4 are output as first output data.
- the third input data and the fourth input data are respectively at least one weight and at least one input neuron
- the compression mapping circuit 101 determines that the absolute value of the at least one input neuron is greater than the first The position of the input neuron of the threshold value, and the connection relationship data of the input neuron is obtained; the compression mapping circuit 101 determines a position of the weight value of the at least one weight value that is greater than the weight of the second threshold value, and obtains a connection relationship of the weight value data.
- the compression mapping circuit 101 obtains a new connection relationship data according to the connection relationship data of the weights and the connection relationship data of the input neurons, and the connection relationship data represents an input of the at least one input neuron whose absolute value is greater than the first threshold. The relationship between neurons and output neurons and the value of the corresponding weights.
- the compression mapping circuit 101 acquires the compressed input neurons and the weights after the compression processing based on the new connection relationship data, the at least one input neuron, and the at least one weight.
- the compression mapping circuit 101 stores the compressed input neurons and the compressed weights in a one-to-one correspondence format in the storage circuit.
- the specific manner in which the compression mapping circuit 101 stores the compression-processed input neurons and the compression-processed weights in a one-to-one correspondence manner is that each of the compression-processed input neurons is The compressed input neuron and its corresponding compressed processed weight are used as a data set, and the data set is stored in the storage circuit.
- the sparse processing unit 1011 in the compression mapping circuit 101 performs thinning compression processing on the input neurons or weights, reducing the weights. Or input the number of neurons, thereby reducing the number of operations performed by the arithmetic unit, and improving the efficiency of the operation.
- the above compression mapping circuit 101 includes:
- the second sparse processing unit 1013 is configured to: after receiving the third input data, obtain the first connection relationship data according to the third input data, and transmit the first connection relationship data to the connection relationship processing unit 1015;
- the third sparse processing unit 1014 is configured to: after receiving the fourth input data, obtain second connection relationship data according to the fourth input data, and transmit the second connection relationship data to the connection relationship processing unit 1015;
- connection relationship processing unit 1015 is configured to obtain third connection relationship data according to the first connection relationship data and the second connection relationship data, and transmit the third connection relationship data to the second data processing unit. 1016;
- the second data processing unit 1016 is configured to: after receiving the third input data, the fourth input data, and the third connection relationship data, the third connection relationship data according to the third connection relationship data Input data and the fourth input data are subjected to compression processing to obtain fourth output data and fifth output data;
- the first connection relationship data is connection relationship data of the input neuron
- the second connection relationship data The connection relationship data of the weight
- the fourth output data is a compressed input input neuron
- the fifth output data is a weight after compression processing
- the third input data includes at least one weight
- the fourth input data includes at least one input neuron
- the first connection relationship data is a connection relationship data of a weight
- the second connection relationship data is connection relationship data of an input neuron
- the fourth output The data is a weight after compression processing
- the fifth output data is an input neuron after compression processing.
- the first connection relationship data is a character string indicating a position of the input neuron in which the absolute value of the at least one input neuron is greater than the first threshold;
- the first connection relationship data is a character string for indicating whether there is a connection between the input neuron and the output neuron.
- the second connection relationship data is a character string indicating a position of the input neuron in which the absolute value of the at least one input neuron is greater than the first threshold;
- the second connection relationship data is a character string for indicating whether there is a connection between the input neuron and the output neuron.
- first connection relationship data, the second connection relationship data, and the third connection relationship data may be represented by a step index or a direct index.
- step index or a direct index.
- connection relationship processing unit 1015 performs compression processing on the first connection relationship data and the second connection relationship data to obtain third connection relationship data.
- the third connection relationship data may be expressed in the form of a direct index or a step index.
- connection relationship processing unit 1015 performs an AND operation on the first connection relationship data and the second connection relationship data, To obtain the third connection relationship data, the third connection relationship data is expressed in the form of a direct index.
- the character strings indicating the first connection relationship data and the second connection relationship data are stored in the order of the physical address in the memory, may be stored in a high-to-low order, or may be low. Stored in high order.
- connection relationship processing unit 1015 When the first connection relationship data and the second connection relationship data are both expressed in the form of a step index, and the character string indicating the first connection relationship data and the second connection relationship data is in the order of physical address from low to high.
- the connection relationship processing unit 1015 accumulates each element in the character string of the first connection relation data and an element storing a physical address lower than the physical address stored in the element, and the obtained new element constitutes the fourth connection. In the same manner, the connection relationship processing unit 1015 performs the same compression processing on the character string of the second connection relation data to obtain the fifth connection relationship data.
- connection relationship processing unit 1015 selects the same element from the character string of the fourth connection relationship data and the character string of the fifth connection relationship data, and sorts the element values in ascending order to form a new string. .
- the above-described connection relationship processing unit 1015 subtracts each of the elements in the above-mentioned new character string from the element whose value is smaller than the element value to obtain a new element. According to the method, each element in the new string is subjected to a corresponding operation to obtain the third connection relationship data.
- connection relationship processing unit 1015 adds each element in the character string of the first connection relationship data to the adjacent previous element to obtain the fourth connection relationship data "01234"; similarly, the connection relationship The processing unit 1015 performs the same compression processing on the character string of the second connection relation data, and the fifth connection relationship data is "024".
- the connection relationship processing unit 1015 selects the same element from the fourth connection relationship data "01234" and the fifth connection relationship data "024" to obtain a new character string "024".
- the connection relationship processing unit 1015 subtracts each element in the new character string from its neighboring previous element, that is, 0, (2-0), (4-2), to obtain the third connection data. "022".
- connection relationship processing unit 1015 When any one of the first connection relationship data and the second connection relationship data is expressed in the form of a step index and the other is expressed in the form of a direct index, the connection relationship processing unit 1015 indicates the step index in the above.
- the connection relation data is converted into a direct index representation or the connection relation data represented by the direct index is converted into a form represented by a step index. Then, the connection relationship processing unit 1015 performs compression processing in accordance with the above method to obtain the third connection relationship data (ie, the fifth output data).
- connection relationship processing unit 1015 converts the first connection relationship data and the second connection relationship data into The connection relationship data is represented by a step index, and then the first connection relationship data and the second connection relationship data are compressed according to the above method to obtain the third connection relationship data.
- the third input data may be an input neuron or a weight
- the fourth input data may be an input neuron or a weight
- the third input data and the fourth input data are inconsistent.
- the second data processing unit 1016 selects data related to the third connection relationship data from the third input data (ie, input neurons or weights) as the fourth output data according to the third connection relationship data;
- the second data processing unit 1016 selects data related to the third connection relationship data from the fourth input data as the fifth output data according to the third connection relationship data.
- the second data processing unit 1016 stores the compressed input neurons and the corresponding compressed processed weights as one data set, and stores the data set in the storage circuit.
- the third input data includes input neurons i1, i2, i3, and i4, and the fourth input data includes weights w11, w21, w31, and w41, and the third connection relationship data is expressed in a direct index manner.
- the fourth output data output by the second data processing unit 1016 is the input neurons i1 and i3, and the output fifth output data is the weights w11 and w31.
- the second data processing unit 1016 described the input neuron i1 and the weight w11 and the input neuron i3 and the weight w31 as a data set, respectively, and stores the two data sets in the storage circuit.
- the sparse processing unit in the compression mapping circuit 101 pairs the input neurons and the weights
- the values are all thinned and compressed, so that the number of input neurons and weights is further reduced, thereby reducing the computational complexity of the arithmetic unit and improving the computational efficiency.
- the compression mapping circuit 101 before the compression mapping circuit 101 performs compression processing on the input data, the compression mapping circuit 101 is further configured to:
- the first preset condition comprising an input neuron of an input neuron having an absolute value less than or equal to a third threshold
- the number is less than or equal to the fourth threshold
- each set of weights of the N sets of weights satisfies a second preset condition, where the second preset condition includes that the number of weights of the set of weights whose absolute value is less than or equal to the fifth threshold is less than or Equal to the sixth threshold;
- the third threshold may be 0.5, 0.2, 0.1, 0.05, 0.025, 0.0, 0 or other values.
- the fourth threshold is related to the number of input neurons in the set of input neurons.
- the fourth threshold the number of input neurons in a set of input neurons - 1 or the fourth threshold is other values.
- the fifth threshold may be 0.5, 0.2, 0.1, 0.05, 0.025, 0.01, 0 or other values.
- the sixth threshold is related to the number of weights in the set of weights.
- the sixth threshold the number of weights - 1 in a set of weights or the sixth threshold is another value.
- the storage circuit is configured to store the input neuron after the compression process, the weight value after the compression process, and related operation instructions.
- the compression mapping circuit shown in FIG. 1g can compress the input data by using the connection relationship data of the input data in the case where the connection relationship data of the input data is known.
- the input data includes at least one input neuron or at least one weight.
- the compression mapping circuit 601 includes:
- the input data buffer unit 6011 is configured to buffer the first input data, where the first input data includes at least one input neuron or at least one weight.
- connection relationship buffer unit 6012 is configured to cache the connection relationship data of the first input data, that is, the connection relationship data of the input neurons or the connection relationship data of the weights.
- connection relationship data of the input neuron is a character string for indicating whether the absolute value in the input neuron is less than or equal to the first threshold
- connection relationship data of the weight is whether the absolute value of the weight is less than or equal to
- the character string of the first threshold is a character string indicating whether there is a connection between the input neuron and the output neuron corresponding to the weight.
- the connection relationship data of the input neuron and the connection relationship data of the weight may be expressed in the form of a direct index or a step index.
- the fourth sparse processing unit 6013 is configured to perform compression processing on the first input data according to the connection relationship data of the first input data, to obtain the first input data after the compression processing, and to obtain the first processed data after the compression processing.
- An input data is stored in the first input buffer unit 605.
- the fourth sparse processing unit 6013 compresses and processes one input neuron and one connection relationship in one clock cycle, that is, from S1 input neurons in one clock cycle. Select a valid input neuron and S1 is an integer greater than one.
- the fourth sparse processing unit 6013 compresses and processes a plurality of input neurons and a plurality of connection relationship data in one clock cycle, that is, selects a valid S2 from the S1 input neurons in one clock cycle.
- Input data the above S2 is an integer greater than 0 and less than or equal to the S1.
- the input neurons are i1, i2, i3, and i4, the connection relationship data expressed in the form of a direct index is "1011", and the fourth sparse processing unit 6013 is available in one clock cycle.
- One connected (ie, valid) input neuron is selected from four input neurons.
- the fourth sparse processing unit 6013 acquires the input neurons i1, i2, i3, and i4 and the connection relationship data "1011" from the input data buffer unit 6011 and the connection relationship buffer unit 6012, respectively, and the fourth thinning processing.
- the unit 6013 selects the connected input neurons i1, i3 and i4 from the input neurons i1, i2, i3 and i4 based on the connection relationship data "1011".
- the fourth sparse processing unit 6013 can select one connected (ie, valid) input neuron from 4 input neurons in one clock cycle, the fourth sparse processing unit 6013 sequentially outputs the input neural waves in three clock cycles. Elements i1, i3 and i4 are shown in Figure 1h.
- the fourth sparse processing unit 6013 stores the input neurons i1, i3 and i4 described above in the first input buffer unit.
- the input neurons are i1, i2, i3, and i4, and the connection relationship data represented by direct index has two groups, namely, "1011” and "0101", and the fourth sparseness.
- Processing unit 6013 can select two connected (i.e., active) input neurons from four input neurons in one clock cycle.
- the fourth sparse processing unit 6013 selects the connected input neurons i1, i3 and i4 from the input neurons i1, i2, i3 and i4 according to the connection relationship data "1011"; according to the connection relationship data "0101"
- Connected input neurons i2 and i4 are selected among the above input neurons i1, i2, i3 and i4.
- the fourth sparse processing unit 6013 can select two connected (ie, valid) input neurons from four input neurons in one clock cycle, the fourth sparse processing unit 6013 is in the connection relationship data "1011".
- Input clock neurons i1 and i3 are selected from the input neurons i1, i2 and i4 in one clock cycle, and the input neurons i1 and i3 are stored in the first input buffer unit 606, in the second clock cycle.
- the input neuron i4 is selected from the input neurons i1, i2 and i4, and the input neuron i4 is stored in the first input buffer unit 606; for the connection relationship data "0101", the fourth sparse processing unit 6013 is One clock cycle selects input neurons i2 and i4 from the input neurons i2 and i4 described above, as shown in Fig. 1i.
- the fourth sparse processing unit 6013 stores the sum of the output neurons i2 and i4 described above into the first input buffer unit.
- the input data is input neurons i1, i2, i3, and i4, the connection relationship data expressed in the form of a step index is "021", and the fourth sparse processing unit 6013 is in a clock.
- the cycle selects one connected (ie, valid) input neuron from 4 input neurons.
- the fourth sparse processing unit 6013 obtains the input neurons i1, i2, i3, and i4 and the connection relationship data “021” from the input data buffer unit 6011 and the connection relationship buffer unit 6012, respectively, and the fourth sparse processing.
- the unit 6013 selects the connected input neurons i1, i3 and i4 from the input neurons i1, i2, i3 and i4 based on the connection relationship data "1011". Since the fourth sparse processing unit 6013 can select one connected (ie, valid) input neuron from 4 input neurons in one clock cycle, the fourth sparse processing unit 6013 sequentially outputs the input neural waves in three clock cycles. Elements i1, i3 and i4 are shown in Figure 1j. The fourth sparse processing unit 6013 stores the input neurons i1, i3 and i4 described above in the first input buffer unit.
- the input data is the input neurons i1, i2, i3 and i4, and the connection relationship data represented by the step index is two groups, respectively "021” and "22",
- the fourth sparse processing unit 6013 can select two connected (ie, valid) input neurons from the four input neurons in one clock cycle.
- the fourth sparse processing unit 6013 selects the connected input neurons i1, i3 and i4 from the input neurons i1, i2, i3 and i4 according to the connection relationship data "021”; according to the connection relationship data "22"
- Connected input neurons i2 and i4 are selected among the above input neurons i1, i2, i3 and i4.
- the fourth sparse processing unit 6013 can select two connected (ie, valid) input neurons from four input neurons in one clock cycle, the fourth sparse processing unit 6013 is in the connection relationship data "021".
- One clock cycle selects input neurons i1 and i3, and stores the input neurons i1 and i3 into the first input buffer unit 606 described above.
- the input neuron i4 is selected from the input clock neuron i4 and stored in the first input buffer unit 606 in the second clock cycle; for the connection relationship data "22", the fourth sparse processing unit 6013 is from one clock cycle
- Input neurons i2 and i4 are selected and output, as shown in Figure 1k.
- the fourth sparse processing unit 6013 stores the input neurons i2 and i4 described above into the first input buffer unit.
- the first input data used by the input data buffer unit 6011 for buffering includes at least one weight
- the data cached by the connection relationship buffer unit 6012 is the connection relationship data of the weight, and the foregoing
- the fourth sparse processing unit 6013 sets the value of the weight between the input neuron and the output neuron without the connection relationship according to the connection relationship data of the weight. 0, and the weight value of 0 and the at least one weight value are stored in the second input buffer unit.
- the weight is in the form of wij, which represents the weight between the i-th input neuron and the j-th output neuron. It is assumed that the input neurons include i1, i2, i3 and i4, the output neurons include o1, and the first input data (weight) is w11, w31, w41, and the connection data of the first input data (ie, the above weights)
- the connection relationship data is expressed in the form of a direct index, which is 1011.
- the fourth sparse processing unit 6013 determines that there is no connection between the input neuron i2 and the output neuron o1 according to the second input data, and the fourth sparse processing unit 6013 sets the value of the weight w21 between the input neuron i2 and the output neuron o1 to 0, and stores w11, w21(0), w31, w41 in the second input buffer unit.
- the first input buffer unit is configured to cache the input neurons after the compression processing.
- the second input buffer unit is configured to buffer the weight of the compression process read from the storage circuit.
- the fourth sparse processing unit 6013 compresses a weight and a connection relationship in one clock cycle, that is, from S3 in one clock cycle.
- One of the weights is selected as a valid weight, and the S3 is an integer greater than one.
- the fourth sparse processing unit 6013 compresses and processes the plurality of weights and the plurality of connection relationship data in one clock cycle, that is, selects valid S4 weights from the S3 weights in one clock cycle, where the foregoing S4 is An integer greater than 0 and less than or equal to the S3.
- the first input buffer unit is configured to cache the weight of the compression process.
- the second input buffer unit is configured to buffer the compressed input input neurons read from the storage circuit.
- the compression mapping circuit 601 is further configured to: group the at least one input neuron to obtain M sets of input neurons.
- the M is an integer greater than or equal to 1; determining whether each set of input neurons of the M group of input neurons meets a first preset condition, and the first preset condition includes an absolute of a set of input neurons The number of input neurons whose value is less than or equal to the third threshold is less than or equal to the fourth threshold; when any one of the M sets of input neurons does not satisfy the first preset condition, the group is input Deleting the at least one weight to obtain N sets of weights, the N being an integer greater than or equal to 1; determining whether each set of weights of the N sets of weights satisfies a second pre- a condition that the second preset condition includes that the number of weights in which the absolute value of the set of weights is less than or equal to the fifth threshold is less than or equal to the sixth threshold; when the set of weights of the
- the first threshold, the second threshold, the third threshold, the fourth threshold, the fifth threshold, and the sixth threshold may all be stored in the storage circuit or the first output buffer unit; the first threshold, the second threshold, and the third threshold And a portion of the fourth threshold, the fifth threshold, and the sixth threshold are stored in the storage circuit, and the partial threshold is stored in the first output buffer unit.
- the first input buffer unit, the second input buffer unit, and the output buffer unit may be functional units in the compression mapping circuit or the main processing circuit, or may be functional units shared by other processing circuits. Not limited.
- connection relationship data of the input neuron and the connection relationship data of the weight are composed of a string/matrix represented by 0 or 1, wherein 0 represents the input neuron/station
- 0 represents the input neuron/station
- the absolute value of the weight value is less than or equal to the first threshold, and 1 indicates that the absolute value of the input neuron/the weight is greater than the first threshold, regardless of the output neuron.
- the connection relationship data ie, the connection data of the neurons/weights
- connection relation data of the weights and/or the connection relationship data of the neurons in the present application may be in the following cases: List of Lists (LIL), Coordinate list (COO), Compressed Sparse Row (CSR), Compressed Sparse Column (CSC), (ELL Pack, ELL), Hybrid (HyB, HYB), etc., this application does not Do it in detail.
- the input neurons and output neurons mentioned in the embodiments of the present application do not refer to neurons in the input layer of the entire neural network and neurons in the output layer, but to any adjacent two layers in the neural network.
- the neurons in the lower layer of the network feedforward operation are the input neurons
- the neurons in the upper layer of the network feedforward operation are the output neurons.
- the layer, the neurons in the layer are the above input neurons, the K+1 layer is called the input layer, and the neurons in the layer are the above-mentioned output neurons, that is, each layer can be used as an input except the top layer.
- Layer, the next layer is the corresponding output layer.
- the following provides a method for implementing calculations using the apparatus shown in FIG. 1a, which may be a calculation method of a neural network, such as a forward operation of a neural network, a training of a neural network, and a practical application.
- the operation may perform a matrix multiplication matrix, a convolution operation, an activation operation, a transformation operation, and the like according to different input data, and the above operations may be implemented by using the apparatus shown in FIG. 1a.
- the control circuit of the main processing circuit transmits the data to the basic processing circuit through the branch processing circuit; wherein the branch processing circuit can first compress the data through the compression mapping circuit and then forward the data to the basic processing circuit.
- the compression processing circuit of the branch processing circuit compresses the data and then transmits the compressed data to the basic processing circuit, which has the advantages of reducing the data amount of the transmitted data, reducing the total number of bits transmitted, and performing the basic processing circuit. Data operations are also more efficient and consume less power.
- the branch processing circuit can receive the data and then compress the data by the compression mapping circuit and then perform calculation. For example, the branch processing circuit receives the sparse data transmitted by the main processing circuit.
- the compression mapping circuit compresses the data, and then sends the inner product operator circuit, the vector operator circuit or the accumulator circuit to the basic processing circuit to calculate the compressed data, thereby improving the operation efficiency and reducing the power consumption.
- the main processing circuit transmits the data to be calculated to all or a part of the basic processing circuit; taking the matrix multiplication by the vector calculation as an example, the control circuit of the main processing circuit can split the matrix data into each column as a basic data, for example, m*n
- the matrix can be split into n m rows of vectors, and the control circuit of the main processing circuit distributes the split n m rows of vectors to a plurality of basic processing circuits.
- the control circuitry of the main processing circuit can broadcast the vector as a whole to each of the underlying processing circuits.
- the first behavior example if the first vector of n m rows of vectors is 1000 rows, then the two vectors can be divided into 2 vectors.
- the first 500 lines are composed of the first vector, and the last 500 lines are composed of the second vector.
- the control circuit broadcasts 2 vectors to the plurality of basic processing circuits through 2 broadcasts.
- the manner of data transmission may be broadcast or distribution, or any other possible transmission method
- the basic processing circuit After receiving the data, the basic processing circuit performs an operation to obtain an operation result
- the basic processing circuit transmits the operation result back to the main processing circuit
- the operation result may be an intermediate operation result or a final operation result.
- the matrix multiplication vector can be an inner product of each row in the matrix with the vector, and the results are placed into a vector in the order of the corresponding rows.
- the neural network computing device has K basic processing circuits:
- FIG. 2 provides a method for implementing a matrix multiplication vector, which may specifically include:
- Step S201 the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received distribution data in an on-chip buffer of the basic processing circuit and/or In the register;
- the branch circuit when the device includes a branch circuit, the branch circuit includes a compression mapping circuit.
- the control circuit of the main processing circuit compresses each row of the input matrix S (M rows and L columns) through the branch processing circuit and distributes it to one of the K basic processing circuits, and the basic processing circuit receives the data.
- the distribution data is stored in an on-chip cache and/or register of the underlying processing circuitry.
- the branch processing circuit can receive the input matrix S1 (M1 row and L1 column) distributed by the main processing circuit, wherein M1 is less than or equal to M, and L1 is less than or equal to L. That is, S1 belongs to a part of S, that is, the distribution data block described above. Further, the compression mapping circuit of the branch processing circuit compresses each line of data in the input matrix S1 (M1 row L1 column) to obtain a matrix S2 after compression processing (M2 row L2 column). The compressed processed matrix S2 is then forwarded to the base processing circuit. Where M is greater than or equal to M1 and greater than or equal to M2. L is greater than or equal to L1 and greater than or equal to L2.
- the compression mapping circuit culls the data corresponding to the specified value (such as 0) and/or the data less than the preset threshold (such as 0.1) in the input matrix S2 and the matrix P2, and the specific implementation may be based on the matrix S2 and the matrix P2.
- the corresponding mask matrix is culled.
- the data in the matrix S2/P2 at the same position corresponding to the data in the mask matrix is culled.
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) in the foregoing embodiments.
- control circuit of the main processing circuit respectively distributes one row of data of the S matrix to the K basic processing circuits
- the control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
- the set of rows in S distributed to the i-th base processing circuit is Ai, sharing a total of Mi rows, as shown in Figure 2c, which is to be performed on the i-th base processing circuit.
- the received distribution data such as the matrix Ai
- the received distribution data may be stored in a register and/or an on-chip buffer of the i-th base processing circuit.
- Step S202 the control circuit of the main processing circuit transmits each part of the vector P to the K basic processing circuits in a broadcast manner;
- the branch circuit when the device includes a branch circuit, the branch circuit includes a compression mapping circuit.
- the control circuit of the main processing circuit compresses each part of the input vector P (length L) by a corresponding branch processing circuit in a broadcast manner and then transmits it to the K basic processing circuits;
- the branch processing circuit can receive the input vector P1 (length L1) distributed by the main processing circuit, wherein L1 is less than or equal to L.
- P1 belongs to a part of P, which is the broadcast data block described above.
- the compression mapping circuit of the branch processing circuit compresses the data in the input vector P1 (length L1) to obtain a compressed vector P2 (column L2).
- the compressed vector P2 is then forwarded to the base processing circuit.
- L2 is less than or equal to L1 and less than or equal to L.
- control circuit of the main processing circuit can broadcast each part of the vector P only once to the register or the on-chip buffer of each basic processing circuit, and the i-th basic processing circuit obtains the vector P of this time.
- the data is fully multiplexed to complete the inner product operation corresponding to each row in the matrix Ai.
- control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
- the data is not multiplexed, and the inner product operation corresponding to each row in the matrix Ai is completed in stages; the advantage is that the data transmission amount of the vector P of the single transmission inside the basic processing circuit is reduced, and the basic processing circuit buffer and the lower limit can be reduced. / or the capacity of the register to improve execution efficiency, reduce transmission power consumption, and reduce costs.
- control circuit of the main processing circuit can broadcast each part of the vector P to the register or the on-chip buffer of each basic processing circuit multiple times, and the i-th basic processing circuit obtains the vector P for each time.
- the data is partially multiplexed to complete the inner product operation corresponding to each row in the matrix Ai; the advantage is that the data transmission amount from the main processing circuit to the basic processing circuit is reduced, and the data transmission amount inside the basic processing circuit is also reduced, and the execution is improved. Efficiency, reducing transmission power consumption.
- Step S203 the inner product operator circuit of each of the K basic processing circuits calculates an inner product of the data of the matrix S and the vector P, for example, the i-th basic processing circuit, and calculates an inner product of the data of the matrix Ai and the data of the vector P;
- the basic processing circuit may first compress the matrix S and the vector P by using the compression mapping circuit in the basic processing circuit, and then calculate the compression processing by using the inner product computing circuit.
- the inner product of the data of matrix S and vector P may be calculated by using the inner product computing circuit.
- the compression mapping circuit compresses the input matrix S (M1 row L1 column) to obtain a compressed matrix S (M rows and L columns).
- the data corresponding to the data in the input matrix S and the vector P is specified to be a specified value (such as 0) and/or the data is less than a preset threshold (such as 0.1).
- the specific implementation may be based on the mask corresponding to the matrix S and the vector P.
- the matrix is culled, for example, the data in the matrix S/P at the same position corresponding to the data in the mask matrix is 0.
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) and the like in the foregoing embodiments.
- Step S204 The accumulator circuit of the K basic processing circuits accumulates the result of the inner product operation to obtain an accumulated result, and transmits the accumulated result to the main processing circuit in a fixed point type.
- each part of the basic processing circuit may perform an inner product operation to obtain a part and (partial and part of the accumulated result, for example, the accumulated result is: F1*G1+F2*G2+F3*G3+F4 *G4+F5*G5, then the part and can be: the value of F1*G1+F2*G2+F3*G3) is transferred back to the main processing circuit for accumulation; the advantage is that the calculation amount inside the basic processing circuit is reduced, and the basis is improved. The computational efficiency of the processing circuit.
- the part obtained by the inner product operation performed by each basic processing circuit and the register and/or the on-chip buffer stored in the basic processing circuit may be transferred to the main processing circuit after the accumulation is completed;
- the data transmission between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, and the data transmission power consumption is reduced.
- the main processing circuit performs accumulation, and is transferred back to the main processing circuit after the accumulation is completed; the advantage is that the data transmission amount between the basic processing circuit and the main processing circuit is reduced, the operation efficiency is improved, the data transmission power consumption is reduced, and the basic processing is reduced.
- the amount of calculation inside the circuit improves the computational efficiency of the basic processing circuit.
- the neural network computing device has K basic processing circuits:
- Step S201b the control circuit of the main processing circuit distributes each row of data in the matrix S to one of the K basic processing circuits, and the basic processing circuit saves the received data in an on-chip buffer and/or a register;
- the branch processing circuit is provided with a compression mapping circuit, and the control circuit of the main processing circuit compresses each row of data in the matrix S through the branch processing circuit and distributes the data to the K basic processing circuits.
- the basic processing circuit saves the received data in an on-chip buffer and/or register;
- the control circuit of the main processing circuit compresses each row of the input matrix S (M rows and L columns) through the branch processing circuit and distributes it to one of the K basic processing circuits.
- the branch processing circuit can receive the input matrix S1 (M1 row L1 column) distributed by the main processing circuit, wherein M1 is less than or equal to M, and L1 is less than or equal to L.
- the compression mapping circuit of the branch processing circuit compresses each of the input matrix S1 (M1 row L1 column) to obtain a compressed matrix S2 (M2 row L2 column).
- the compressed processed matrix S2 is then forwarded to the corresponding base processing circuit.
- M is greater than or equal to M1 and greater than or equal to M2.
- L is greater than or equal to L1 and greater than or equal to L2.
- the compression mapping circuit culls the data corresponding to the specified value (such as 0) and/or the data less than the preset threshold (such as 0.1) in the input matrix S2 and the matrix P2, and the specific implementation may be based on the matrix S2 and the matrix P2.
- the corresponding mask matrix is culled.
- the data in the matrix S2/P2 at the same position corresponding to the data in the mask matrix is culled.
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) in the foregoing embodiments.
- control circuit of the main processing circuit respectively distributes one row of the S matrix to the M basic processing circuits
- control circuitry of the main processing circuit distributes one or more rows of data in the S matrix to each of the underlying processing circuits.
- the Mi line in S is distributed to the i-th base processing circuit, and the set of Mi lines is called Ai, and the calculation to be performed on the i-th basic processing circuit is shown in Fig. 2e.
- each of the basic processing circuits such as the i-th basic processing circuit:
- the received matrix Ai distributed by the main processing circuit stores the matrix Ai in the i-th basic processing circuit register and/or the on-chip buffer; the advantage is that the amount of data transmission afterwards is reduced, the calculation efficiency is improved, and the power consumption is reduced.
- Step S202b the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast manner;
- the branch processing circuit is provided with a compression mapping circuit, and the control circuit of the main processing circuit compresses and processes each part of the matrix P by a branch processing circuit in a broadcast manner, and then transmits the data to each basic processing circuit;
- the branch processing circuit can receive the input vector P1 (length L1) distributed by the main processing circuit, wherein L1 is less than or equal to L.
- P1 belongs to a part of P, which is the broadcast data block described above.
- the compression mapping circuit of the branch processing circuit compresses the data in the input vector P1 (length L1) to obtain a compressed vector P2 (column L2).
- the compressed vector P2 is then forwarded to the base processing circuit.
- L2 is less than or equal to L1 and less than or equal to L.
- each part of the matrix P can be broadcast only once to the registers of the respective basic processing circuits or the on-chip buffer, and the i-th basic processing circuit fully multiplexes the data of the matrix P obtained this time.
- the inner product operation corresponding to each row in the matrix Ai is completed; the multiplexing in this embodiment may be repeatedly used in the calculation by the basic processing circuit, for example, the multiplexing of the data of the matrix P, and the data of the matrix P may be use many times.
- control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
- the data is not multiplexed, and the inner product operations corresponding to each row in the matrix Ai are completed in stages;
- control circuit of the main processing circuit can broadcast the parts of the matrix P to the registers of the respective basic processing circuits or the on-chip buffer multiple times, and the i-th basic processing circuit pairs the matrix P obtained each time.
- the data is partially multiplexed to complete an inner product operation corresponding to each row in the matrix Ai;
- each of the basic processing circuits calculates an inner product of the data of the matrix Ai and the data of the matrix P;
- Step S203b the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit.
- the compression processing circuit is disposed in the basic processing circuit, and the result of the inner product operation may be that the basic processing circuit compresses the matrix S and the matrix P, and then uses the inner product computing circuit to calculate the compression processing.
- the compression mapping circuit compresses the input matrix S (M1 row L1 column) and the input matrix P (L1 row N1 column) to obtain a compressed matrix S (M rows and L columns) and a matrix P (L rows and N columns).
- the operator of the basic processing unit may perform an inner product operation on the compressed matrix S and the matrix P to obtain a structure of the inner product operation.
- the data corresponding to the data in the input matrix S and the matrix P is specified to be a specified value (such as 0) and/or the data is less than a preset threshold (such as 0.1).
- the specific implementation may be based on the corresponding mask of the matrix S and the matrix P.
- the matrix is culled, for example, the data in the matrix S/P at the same position corresponding to the data in the mask matrix is 0.
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) in the foregoing embodiments.
- the basic processing circuit may accumulate the portion obtained by performing the inner product operation each time and transmit back to the main processing circuit;
- the portion obtained by the inner product operation performed by each basic processing circuit may be stored in a register and/or an on-chip buffer of the basic processing circuit, and then accumulated and returned to the main processing circuit;
- the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, according to the usage method of the device. Perform the operation of the matrix multiplication vector as shown in FIG. 2;
- the weight matrix of the fully connected layer is used as the matrix S, the input vector is used as the matrix P, or the fully connected layer is used.
- Step S301 The control circuit of the main processing circuit distributes the weight of each convolution kernel in the convolution layer weight to one of the K basic processing circuits, and stores the on-chip buffer and/or the register in the basic processing circuit. in;
- the branch processing circuit includes a compression mapping circuit, and the control circuit of the main processing circuit compresses the weight of each convolution kernel in the convolution layer weight by the branch processing circuit, and then distributes the One of the K basic processing circuits is stored in an on-chip buffer and/or register of the underlying processing circuit;
- the branch processing circuit may use the compression mapping circuit of the branch processing circuit to each of the convolution layer weights.
- the weight of a convolution kernel is compressed to correspond to the weight of each convolution kernel in the convolution layer weights after compression processing, and then forwarded to the basic processing circuit for operation.
- the control circuit of the main processing circuit respectively distributes the weights of one convolution kernel to the M basic processing circuits;
- control circuitry of the main processing circuit distributes the weights of one or more convolution kernels to each of the base processing circuits.
- a total of Mi convolution kernels are distributed to the i-th base processing circuit, and the set of these convolution kernel weights is called Ai.
- each of the basic processing circuits such as the i-th basic processing circuit:
- the received convolution kernel weight Ai distributed by the main processing circuit is stored in its register and/or on-chip buffer;
- Step S302 the control circuit of the main processing circuit transmits each part of the input data P to each basic processing circuit in a broadcast manner;
- the branch processing circuit includes a compression mapping circuit, and the control circuit of the main processing circuit compresses each part of the input data P by a corresponding branch processing circuit in a broadcast manner, and then forwards the data to each basic processing circuit. , no longer repeat them here.
- control circuit of the main processing circuit can broadcast each part of the input data P only once to the register or on-chip buffer of each basic processing circuit, and the i-th basic processing circuit inputs the input data for this time.
- the data of P is fully multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed;
- control circuit of the main processing circuit can broadcast each part of the input data P to the register or the on-chip buffer of each basic processing circuit multiple times, and the input data obtained by the i-th basic processing circuit for each time.
- the data of P is not multiplexed, and the inner product operation corresponding to each convolution kernel in Ai is completed in stages;
- control circuit of the main processing circuit can broadcast each part of the input data P to the register or the on-chip buffer of each basic processing circuit multiple times, and the input data obtained by the i-th basic processing circuit for each time.
- the data of P is partially multiplexed to complete an inner product operation corresponding to each convolution kernel in Ai;
- each basic processing circuit calculates a data inner product of the convolution kernel and the input data P, for example, an i-th basic processing circuit, and calculates an inner product of each convolution kernel of the Ai and the data of the input data P;
- the compression mapping circuit in the basic processing circuit may be used to firstly roll the volume.
- the product core and the input data P are subjected to compression processing, and then the inner product is calculated by the inner product operator circuit using the inner product of the compressed convolution kernel and the input data P.
- the i-th basic processing circuit calculates the inner product of each convolution kernel of the compressed Ai and the data of the input data P after the compression processing.
- Step S304 the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits it back to the main processing circuit:
- the base processing circuit may accumulate the portion obtained by performing the inner product operation each time and transmit back to the main processing circuit;
- the basic processing circuit may also save the portion obtained by the inner product operation performed each time and the register and/or the on-chip buffer stored in the basic processing circuit, and then transfer back to the main processing circuit after the accumulation is completed;
- the basic processing circuit may also accumulate the portion obtained by each inner product operation and, in some cases, the register and/or the on-chip buffer of the basic processing circuit, and in some cases transmit to The main processing circuit performs accumulation, and is transferred back to the main processing circuit after the accumulation is completed;
- the vector updater circuit of the main processing circuit is used to implement the weight update function in the neural network training process.
- the weight update refers to a method of updating the weight using the gradient of the weight.
- the vector operator circuit of the main processing circuit is used to add and subtract the two vectors of the weight and the weight gradient to obtain an operation result, and the operation result is an update weight.
- the vector operator circuit using the main processing circuit multiplies or divides the weight and the weight gradient by a number to obtain an intermediate weight and an intermediate weight gradient value, and the vector operator circuit pairs the intermediate weight And the intermediate weight gradient value is added and subtracted to obtain the operation result, and the operation result is the update weight.
- a set of momentum can be calculated by using the gradient of the weights, and then the updated weights are obtained by adding and subtracting the momentum and the weights.
- the application also provides a chip, the chip comprising a computing device, the computing device comprising:
- the data involved in the main processing circuit may be compressed data, and in an optional embodiment, the compressed data includes at least one input neuron or at least one weight.
- Each of the at least one neuron is greater than a first threshold or each of the at least one weight is greater than a second threshold.
- the first threshold and the second threshold are customized by the user side, and they may be the same or different.
- the main processing circuit comprises a compression mapping circuit; in an alternative, the main processing circuit comprises an arithmetic unit that performs data compression processing, such as a vector arithmetic unit; in particular, data containing received input data input interface;
- the computing device further includes: a branch processing circuit, wherein the data involved in the branch processing circuit may be compressed processed data.
- the data includes at least one input neuron or at least one weight, each of the at least one neuron being greater than a first threshold or each of the at least one weight being greater than a second threshold.
- the first threshold and the second threshold are customized by the user side, and they may be the same or different.
- the branch processing circuit comprises a compression mapping circuit
- the branch processing circuit includes an arithmetic unit that performs data compression processing, such as a vector operation unit, etc.; specifically, a data input interface that receives input data;
- the received data source may be: external to the neural network operation circuit device or part or all of the basic processing circuit of the neural network operation circuit device;
- the data input interface may have multiple; specifically, may include a data output interface for outputting data;
- the outgoing data may be: external to the neural network computing device or part or all of the basic processing circuit of the neural network computing circuit device;
- the data output interface may have multiple;
- the branch processing circuit includes an on-chip buffer and/or a register
- the branch processing circuit includes an operation unit, and can perform a data operation
- the branch processing circuit includes an arithmetic operation unit
- the branch processing circuit includes a vector operation unit, and can perform operations on a group of data at the same time; specifically, the arithmetic operation and/or the vector operation can be any type of operation, including but not Limited to: two numbers plus, minus, multiply and divide, one number plus constant plus, minus, multiply and divide, perform exponential operation, power operation, logarithm operation, and various nonlinear operations on one number, perform comparison operation on two numbers, logical operation Wait. Two vectors are added, subtracted, multiplied, and divided. Each element in a vector is added, subtracted, multiplied, and divided by a constant. Each element in the vector is subjected to exponential operations, power operations, logarithmic operations, and various nonlinear operations. Each of the two corresponding elements in the vector performs a comparison operation, a logical operation, and the like.
- the arithmetic operation and/or the vector operation can be any type of operation, including but not Limited to: two numbers plus, minus, multiply and divide, one number plus constant plus,
- the main processing circuit includes a data rearranging unit for transmitting data to the basic processing circuit in a certain order, or rearranging the data in place in a certain order;
- the order of the data arrangement comprises: performing a dimensional order transformation on a multi-dimensional data block; the order of the data arrangement may further comprise: segmenting a data block to send to a different basis. Processing circuit.
- the computing device also includes a plurality of basic processing circuits: each of the basic processing circuits is configured to calculate an inner product of the two vectors, the calculation method is, the two sets of numbers received by the basic processing circuit, and the elements in the two sets of numbers are corresponding Multiply, and accumulate the results of the multiplication; the result of the inner product is transmitted, where it is transmitted according to the position of the basic processing circuit, and may be transmitted to other basic processing circuits, or directly to the main processing circuit.
- the data involved in the basic processing circuit may be compressed processed data.
- the compressed processed data includes at least one input neuron or at least one weight, the at least one neural Each of the neurons is greater than a first threshold or each of the at least one weight is greater than a second threshold.
- the first threshold and the second threshold are customized by the user side, and they may be the same or different.
- the basic processing circuit comprises a compression mapping circuit
- the basic processing circuit includes a vector operation unit that performs data compression processing
- a memory unit including an on-chip buffer and/or a register is included;
- a data input interface including one or more received data
- two data input interfaces are included, and one or more data can be respectively obtained from two data input interfaces at a time;
- the basic processing circuit may receive the input data from the data input interface and save it in a register and/or an on-chip buffer;
- the source of the data input interface receiving data may be: other basic processing circuits and/or main processing circuits.
- the neural network operation circuit device has a plurality of basic processing circuits
- the data output interface includes one or more transmission output data
- one or more data can be transmitted from the data output interface
- the data transmitted through the data output interface may be: data received from the data input interface, data stored in the on-chip buffer and/or register, multiplier operation result, accumulator operation result, or inner product operator operation One or any combination of the results.
- three data output interfaces are included, two of which correspond to two data input interfaces, each layer outputs a layer of data received from the data input interface, and a third data output interface Responsible for outputting the operation result;
- the data output interface may transmit data in a direction where the data source and the data direction here determine the connection relationship of the basic processing circuit in the device.
- the arithmetic operation circuit includes: one or more multiplier circuits, one or more accumulator circuits, one or more of one or more circuits that perform two sets of inner product operations. combination.
- two numbers of multiplication operations can be performed, the results of which can be stored in on-chip buffers and/or registers, or directly added to registers and/or on-chip buffers;
- an inner product operation of two sets of data can be performed, and the result can be stored in an on-chip buffer and/or a register, or directly added to a register and/or an on-chip buffer;
- the data accumulation operation can be performed to accumulate the data into the on-chip buffer and or the register;
- the accumulated data of the accumulator circuit may be: data received from the data input interface, data stored in the on-chip buffer and/or register, multiplier operation result, accumulator operation result, inner product operator operation One or any combination of the results.
- data input interface and “data output interface” used in the above description of the basic processing circuit refer to the data input and output interface of each basic processing circuit, rather than the data input and output of the entire device. interface.
- an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits;
- the plurality of basic processing circuits are arranged in an array; each of the basic processing circuits is connected to an adjacent other basic processing circuit, the main processing circuit is connected to k basic processing circuits of the plurality of basic processing circuits, the k
- the basic circuits are: n basic processing circuits in the first row and m basic processing circuits in the first column;
- Some or all of the plurality of basic processing circuits include: a compression mapping circuit for performing compression processing of each data in the neural network operation;
- the main processing circuit is configured to perform each successive operation in the neural network operation and to transmit data with the k basic processing circuits;
- the k basic processing circuits for data transfer between the main processing circuit and a plurality of basic processing circuits
- the part or all of the basic processing circuit is configured to determine, according to the operation control of the transmission data, whether to start the compression mapping circuit to perform compression processing on the transmission data, and perform the operation in the neural network in parallel according to the compressed transmission data. And transmitting the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- the plurality of basic processing circuits each include a compression mapping circuit
- the plurality of basic processing circuits are configured to determine whether to activate the compression mapping circuit according to operation control of the transmission data.
- the data is transmitted for compression processing, and the operations in the neural network are performed in parallel according to the compressed data after the compression processing, and the operation result is transmitted to the main processing circuit through the k basic processing circuits.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the K basic processing circuits, and broadcasting the broadcast data block to the k basic processing circuits
- the plurality of basic processing circuits configured to start the compression mapping circuit to perform compression processing on the basic data block and the broadcast data block according to the received basic data block, the broadcast data block, and the operation instruction, where the compressed data is processed Performing an inner product operation on the basic data block and the compressed data block to obtain an operation result, and transmitting the operation result to the main processing circuit through the k basic processing circuits;
- the main processing circuit is used for Processing the result of the operation to obtain the data block to be calculated and the instruction result of the operation instruction; wherein the distribution data block and the broadcast data block At least one input neuron or the at least one weight.
- the k basic processing circuits of the plurality of basic processing circuits each include a compression mapping circuit
- the k basic processing circuits are configured to determine whether according to operation control of the transmission data And starting the compression mapping circuit to perform compression processing on the transmission data row, and transmitting the compressed transmission data to a basic processing circuit connected to the k basic processing circuits;
- the plurality of basic processing circuits are configured to perform The compression-processed transmission data performs an operation in the neural network in a parallel manner, and transmits the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- the main processing circuit is configured to acquire a data block to be calculated and an operation instruction, and divide the data block to be calculated into a data block and a broadcast data block according to the operation instruction; Distributing the data block for split processing to obtain a plurality of basic data blocks, distributing the plurality of basic data blocks to the K basic processing circuits, and broadcasting the broadcast data block to the k basic processing circuits
- the k basic processing circuits are configured to start the compression mapping circuit to perform compression processing on the basic data block and the broadcast data block according to the received basic data block, the broadcast data block, and the operation instruction, and then transmit the data to the k a basic processing circuit connected to the basic processing circuit;
- the plurality of basic processing circuits configured to perform an inner product operation on the compressed basic data block and the broadcast data block to obtain an operation result, and the operation result Sending to the main processing circuit;
- the main processing circuit is configured to process the operation result to obtain the data block to be calculated and the instruction of the operation instruction The result; wherein the distribution data block and the broadcast data block
- FIG. 4a is an integrated circuit chip device according to the present disclosure.
- the integrated circuit chip device includes: a main processing circuit and a plurality of basic processing circuits, and the plurality of basic processing circuits are arranged in an array (m*n Array), wherein the values of m and n are integers greater than or equal to 1 and at least one of m and n is greater than or equal to 2.
- each of the basic processing circuits is connected to an adjacent basic processing circuit, the main processing circuit connecting k basic processing circuits of the plurality of basic processing circuits, the k
- the basic processing circuit may be: n basic processing circuits in the first row, n basic processing circuits in the mth row, and m basic processing circuits in the first column.
- the main processing circuit and/or the plurality of basic processing circuits may include a compression mapping circuit, and a part of the plurality of basic processing circuits may include a compression mapping circuit, for example,
- the k basic processing circuits may be configured with a compression mapping circuit, so that the n basic processing circuits may be respectively responsible for performing data compression processing steps on the data of the m basic processing circuits of the column. This setting can improve the operation efficiency and reduce the power consumption, because for the n basic processing circuits of the first row, since the data transmitted by the main processing circuit is received first, the compression of the received data can be reduced.
- the calculation amount of the subsequent basic processing circuit and the amount of data transmission with the subsequent basic processing circuit are similar.
- the compression mapping circuit also has the advantages of small calculation amount and low power consumption.
- the main processing circuit can adopt a dynamic data transmission strategy. For example, the main processing circuit broadcasts data to the m basic processing circuits of the first column, and the main processing circuit transmits and distributes to the n basic processing circuits of the first row. Data, the advantage is that different data is transferred to the basic processing circuit through different data input ports, so that the basic processing circuit can distinguish which data the received data is, and only needs to determine from which receiving port the data is received. You can know what kind of data it belongs to.
- the main processing circuit is configured to perform each successive operation in the neural network operation and the basic processing circuit connected thereto to transmit data; the continuous operation is not limited to: an accumulation operation, an ALU operation, an activation operation, and the like. .
- the plurality of basic processing circuits are configured to perform an operation in the neural network in a parallel manner according to the transmitted data, and transmit the operation result to the main processing circuit through a basic processing circuit connected to the main processing circuit.
- the above-described parallel execution of operations in the neural network includes, but is not limited to, inner product operations, matrix or vector multiplication operations, and the like.
- the plurality of basic processing circuits may first perform compression processing on the transmitted data, and then perform operations in the neural network in a parallel manner according to the compressed processed data.
- the main processing circuit may include: a data transmitting circuit, a data receiving circuit or an interface, and the data transmitting circuit may integrate the data distributing circuit and the data broadcasting circuit.
- the data distributing circuit and the data broadcasting circuit may also be separately set.
- For broadcast data data that needs to be sent to each of the underlying processing circuits.
- For distributing data it is necessary to selectively send data to a part of the basic processing circuit. Specifically, for example, convolution operation, the convolution input data of the convolution operation needs to be sent to all the basic processing circuits, so the convolution input data is
- the convolution kernel needs to be selectively sent to a portion of the underlying data block, so the convolution kernel is the distribution data.
- the manner in which the distribution data is specifically selected to be sent to the underlying processing circuit can be specifically determined by the main processing circuit in accordance with the load and other allocation methods.
- the broadcast data is transmitted to each of the basic processing circuits in a broadcast form.
- the broadcast data is sent to each of the basic processing circuits by means of one broadcast, and the broadcast data may also be sent to each of the basic processing circuits by means of multiple broadcasts.
- the specific embodiment of the present disclosure does not limit the above.
- the number of times of broadcasting for the distribution transmission method, the distribution data is selectively transmitted to the partial basic processing circuit.
- the accumulator circuit of the nth basic processing circuit of the mth row can perform an accumulation operation of the inner product operation, because for the mth line basic processing circuit, it can receive the product of all the basic processing circuits of the column.
- the accumulation operation of the inner product operation is performed by the n basic processing circuits of the mth row, so that the calculation resources can be effectively allocated, and the power consumption is saved.
- This technical solution is especially suitable for a large number of m.
- the main processing circuit can allocate the executed circuit. Specifically, the executed circuit can be allocated by display or implicit manner. For the display mode, the main processing circuit can be configured with a special instruction or instruction. When the processing circuit receives the special indication or instruction, it is determined to perform data compression processing. If the basic processing circuit does not receive a special indication or instruction, it is determined that the data compression processing is not performed. As another example, the basic processing circuit may perform the implicit operation, for example, when the basic processing circuit receives the data type as sparse data (ie, includes 0, or includes data smaller than a preset threshold than a preset number) and determines that an inner product operation needs to be performed, The sparse data will be compressed.
- the data type ie, includes 0, or includes data smaller than a preset threshold than a preset number
- the special instruction or indication may be configured with a descending sequence, the decrement sequence is decremented by one each time a basic processing circuit is passed, and the basic processing circuit reads the value of the decrementing sequence, and if the value is greater than zero, the data is executed.
- the compression process if the value is equal to or less than zero, does not perform data compression processing.
- This setting is configured according to the basic processing circuit of the array allocation. For example, for the m basic processing circuits of the i-th column, the main processing circuit needs the first five basic processing circuits to perform the compression processing, and the main processing circuit issues one.
- the special instruction includes a descending sequence, and the initial value of the descending sequence may be 5, and the value of the descending sequence is decremented by 1 every time a basic processing circuit is passed, to the 5th basic processing circuit, the descending sequence The value is 1, and when the sixth basic processing circuit is used, the decrementing sequence is 0. At this time, the sixth basic processing circuit will not perform the data compression processing, which can enable the main processing circuit to dynamically configure data compression.
- An embodiment of the present disclosure provides an integrated circuit chip device including a main processing circuit (which may also be referred to as a main unit) and a plurality of basic processing circuits (also referred to as a base unit); the structure of the embodiment is as shown in FIG. 4b.
- the dotted line frame is the internal structure of the neural network computing device;
- the gray filled arrow indicates the data transmission path between the main processing circuit and the basic processing circuit array, and
- the hollow arrows indicate the respective basic processing circuits in the basic processing circuit array ( Data transmission path between adjacent basic processing circuits).
- the length of the basic processing circuit array may be different, that is, the values of m and n may be different, and may of course be the same. The disclosure does not limit the specific value of the above values.
- the circuit structure of the basic processing circuit is shown in Figure 4c; the dotted line in the figure indicates the boundary of the basic processing circuit, and the thick arrow crossing the dotted frame indicates the data input and output channel (the input channel is pointed in the dotted line frame, indicating that the dotted line frame is the output channel) ); the rectangular box in the dashed box indicates the memory cell circuit (register and / or on-chip buffer), including input data 1, input data 2, multiplication or inner product results, accumulate data; diamond box represents the operator circuit, including multiplication or internal Product operator, adder.
- the neural network computing device includes a main processing circuit and 16 basic processing circuits (16 basic processing circuits are for illustrative purposes only, and other values may be used in practical applications);
- the basic processing circuit has two data input interfaces and two data output interfaces; in the subsequent description of this example, the horizontal input interface (the horizontal arrow pointing to the unit in FIG. 4b) is referred to as input 0.
- the vertical input interface (vertical arrow pointing to this unit in Figure 4b) is called input 1; each horizontal data output interface (the horizontal arrow indicated from this unit in Figure 4b) is called output 0, vertical
- the data output interface (the vertical arrow indicated from this unit in Figure 4b) is called Output 1.
- each basic processing circuit can be respectively connected to different units, including a main processing circuit and other basic processing circuits;
- the input processing 0 of the four basic processing circuits of the basic processing circuits 0, 4, 8, 12 (numbered as shown in FIG. 4b) is connected to the data output interface of the main processing circuit;
- the input 1 of the four basic processing circuits of the basic processing circuits 0, 1, 2, 3 is connected to the data output interface of the main processing circuit;
- the output 1 of the four basic processing circuits of the basic processing circuits 12, 13, 14, 15 is connected to the data input interface of the main processing circuit;
- connection of the output interface of the basic processing circuit to the input interfaces of other basic processing circuits is shown in Figure 1b, and will not be enumerated one by one;
- the output interface S1 of the S unit is connected to the input interface P1 of the P unit, indicating that the P unit will be able to receive data from its P1 interface that the S unit sends to its S1 interface.
- the embodiment includes a main processing circuit, and the main processing circuit is connected to an external device (ie, an input interface also has an output interface), and a part of the data output interface of the main processing circuit is connected to a data input interface of a part of the basic processing circuit; the main processing circuit A part of the data input interface is connected to a data output interface of a part of the basic processing circuit.
- an external device ie, an input interface also has an output interface
- the data involved in the usage method provided by the present disclosure may be compressed data.
- FIG. 1e to FIG. 1k For details on how to implement the compression processing of the data, refer to the related descriptions in the foregoing embodiments, for example, FIG. 1e to FIG. 1k, and details are not described herein again.
- the control circuitry of the main processing circuitry can distribute the data to the underlying processing circuitry for operation.
- the compression mapping circuit of the basic processing circuit first compresses the data and then performs the operation, which has the advantages of reducing the amount of data calculation, and the efficiency of the basic processing circuit for performing data operations is higher, and the power consumption is lower)
- the basic processing circuit can receive the data and then compress the data by the compression mapping circuit and then perform calculations. For example, the basic processing circuit receives the sparse data transmitted by the main processing circuit.
- the compression mapping circuit compresses the data, and then the inner product operator circuit, the vector operator circuit or the accumulator circuit of the basic processing circuit performs the operation on the compressed data to improve the operation efficiency and reduce the power consumption.
- the main processing circuit receives input data to be calculated from outside the device
- the main processing circuit performs arithmetic processing on the data by using various operation circuits, a vector operation circuit, an inner product operator circuit, an accumulator circuit, and the like of the unit;
- the main processing circuit sends data to the basic processing circuit array (referred to as a basic processing circuit array) through the data output interface (as shown in FIG. 5b);
- the method of sending data may be to directly send data to a part of the basic processing circuit, that is, multiple broadcast modes;
- the method of transmitting data here may separately send different data to different basic processing circuits, that is, a distribution method
- the basic processing circuit array calculates the data
- the basic processing circuit performs the operation after receiving the input data.
- the basic processing circuit may determine, according to the operation instruction of the data, whether to start the compression mapping unit in the basic processing circuit to compress the data, and then perform the compression processing.
- the data is calculated.
- the basic processing circuit transmits the data from the data output interface of the unit after receiving the data; (transferred to other basic processing circuits that do not directly receive data from the main processing circuit, optionally, the data may also be To compress the processed data.)
- the basic processing circuit transmits the operation result from the data output interface; (intermediate calculation result or final calculation result)
- the main processing circuit receives the output data returned from the basic processing circuit array
- the main processing circuit continues to process (eg, accumulate or activate the operation) the data received from the basic processing circuit array;
- the main processing circuit is processed, and the processing result is transmitted from the data output interface to the outside of the device.
- the matrix multiplication vector can be an inner product of each row in the matrix with the vector, and the results are placed into a vector in the order of the corresponding rows.
- This method uses all or a portion of the basic processing circuit of the neural network computing device, assuming that K basic processing circuits are used;
- the main processing circuit transmits data in part or all of the rows of the matrix S to each of the k basic processing circuits;
- the control circuit of the main processing circuit sends data of a certain number of rows in the matrix S to a certain basic processing circuit each time (for example, for a certain basic processing circuit, The first time, the first number of the third, fourth, and fifth lines is transmitted, the second time is the second, the third, fourth, and fifth lines are the second, and the third time is the third, fourth, and fifth lines.
- the third number of lines ..., or the first two digits of the first, third, fourth, and fifth lines, the second, the third, fourth, and fifth lines, the third and fourth digits of each line, Send the 5th, 4th, and 5th lines, the 5th and 6th lines of each line.
- the control circuit of the main processing circuit sequentially transmits the data in the vector P to the 0th basic processing circuit
- the 0th basic processing circuit After receiving the data of the vector P, the 0th basic processing circuit sends the data to the next basic processing circuit connected thereto, that is, the basic processing circuit 1;
- some basic processing circuits cannot obtain all the data required for calculation directly from the main processing circuit.
- the basic processing circuit 1 in FIG. 2d has only one data input interface connected to the main processing circuit, so it can only directly
- the main processing circuit obtains the data of the matrix S, and the data of the vector P needs to be output to the basic processing circuit 1 by the basic processing circuit 0.
- the basic processing circuit 1 also receives the data and continues to output the data of the vector P to the base. Processing circuit 2.
- each of the k basic processing circuits receives the data, determining, according to the operation instruction of the data (ie, the operation control), whether to activate the corresponding compression mapping circuit to perform the data.
- the compression process is performed, and then the compressed data is used for calculation; optionally, the compressed data can be transmitted to other basic processing units.
- the basic processing circuit After receiving the input matrix S or the matrix P, the basic processing circuit enables the compression mapping circuit to correspond the data in the input matrix S and the matrix P to a specified value (such as 0) and/or the data is less than a preset threshold (such as 0.1). Data culling can be eliminated according to the corresponding mask matrix of the matrix S and the matrix P. For example, the data of the same position corresponding to the mask matrix of 0 in the matrix S/P is removed.
- a specified value such as 0
- a preset threshold such as 0.1
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) and the like in the foregoing embodiments.
- Each of the basic processing circuits performs operations on the received data, including but not limited to: inner product operations, multiplication operations, addition operations, and the like;
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result is transmitted from the data output interface (ie, transmitted to other basic processing circuits connected thereto);
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and processes the result to obtain a final result (the processing may be an accumulation operation or an activation operation, etc.).
- a plurality of basic processing circuits used in the method are arranged in the manner shown in FIG. 5d or FIG. 5e as follows;
- the control circuit of the main processing unit divides the M rows of the matrix S into K groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is denoted as Ai);
- the i-th basic processing circuit can determine whether it is necessary to compress the Ai by using the compression mapping circuit before the operation of the i-th group (the set of rows in the set of data is denoted as Ai), and then The operation is performed on the Ai after the compression processing.
- the mapping circuit performs compression processing on Ai, and then performs an operation on the compressed Ai, which is not limited in this application.
- the method of grouping M rows of data is any grouping method that does not repeatedly allocate;
- the following allocation mode is adopted: the jth row is allocated to the j%K (% is a remainder operation) basic processing circuit;
- a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit sequentially transmits the data in part or all of the rows in the matrix S to the corresponding basic processing circuit;
- control circuit of the main processing circuit sends one or more data in one of the data of the i-th group of data Mi that it is responsible for to the i-th basic processing circuit each time;
- control circuit of the main processing circuit transmits one or more data of each of some or all of the i-th group of data Mi to which it is responsible to the i-th basic processing circuit;
- the control circuit of the main processing circuit sequentially transmits the data in the vector P to the first basic processing circuit
- control circuitry of the main processing circuit can transmit one or more data in the vector P each time;
- the i-th basic processing circuit receives the data of the vector P and sends it to the i+1th basic processing circuit connected thereto.
- the data of the transmitted vector P may be the compressed data.
- Each basic processing circuit receives one or more data from a certain row or rows of the matrix S and one or more data from the vector P, and performs operations (including but not limited to multiplication or addition);
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the data received by the basic processing circuit may also be an intermediate result, stored in a register or an on-chip buffer;
- the basic processing circuit transmits the local calculation result to the next basic processing circuit or main processing circuit connected thereto;
- only the output interface of the last basic processing circuit of each column is connected to the main processing circuit.
- only the last basic processing circuit can directly be localized.
- the calculation result is transmitted to the main processing circuit, and the calculation results of other basic processing circuits are transmitted to the next basic processing circuit, and the next basic processing circuit is transferred to the next basic processing circuit until all is transmitted to the last basic processing circuit.
- the last basic processing circuit performs an accumulated calculation on the local calculation result and the results of other basic processing circuits received in the column to obtain an intermediate result, and sends the intermediate result to the main processing circuit; of course, it may be: the last basic processing circuit
- the results of the other basic circuits in this column and the local processing results are sent directly to the main processing circuit.
- each of the basic processing circuits has an output interface connected to the main processing circuit. In this case, each of the basic processing circuits directly transmits the local calculation result. Give the main processing circuit;
- the basic processing circuit After receiving the calculation result transmitted by other basic processing circuits, the basic processing circuit transmits to the next basic processing circuit or main processing circuit connected thereto.
- the main processing circuit receives the result of the M inner product operations as the operation result of the matrix multiplication vector.
- the control circuitry of the main processing circuit sends data in some or all of the rows of the matrix S to those underlying processing circuitry that are directly connected to the main processing circuitry via the lateral data input interface (eg, the topmost gray filled vertical data in Figure 4b) path);
- control circuit of the main processing circuit sends a certain number or part of the data of a row in the matrix S to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time) Send the first number in the third line, the second number in the third line of data, the third number in the third line in the third time... or the first two lines in the third line.
- the third time sends the 3rd and 4th numbers in the 3rd line
- the third time sends the 5th and 6th numbers in the 3rd line...;
- control circuit of the main processing circuit sends data of a certain number of rows in the matrix S to a certain basic processing circuit each time (for example, for a basic processing circuit,
- the first number of the 3rd, 4th, and 5th lines is sent once, the 2nd, 4th, and 5th lines are the 2nd, and the 3rd, 4th, and 5th lines are the 3rd.
- the third number ..., or the first two digits of the first, third, fourth, and fifth lines, the third, the third, fourth, and fifth lines, the third and fourth digits of each row, and the third Send the 5th, 4th, and 5th lines, the 5th and 6th lines of each line...;
- the control circuitry of the main processing circuit sends data in some or all of the columns in the matrix P to those underlying processing circuitry directly connected to the main processing circuitry via the vertical data input interface (eg, on the left side of the basic processing circuitry array in Figure 4b) Gray filled horizontal data path);
- control circuit of the main processing circuit sends a certain number or part of the data of a column in the matrix P to a certain basic processing circuit each time; (for example, for a certain basic processing circuit, the first time)
- the first number in the third column is transmitted
- the second number in the third column data is transmitted in the second time
- the third number in the third column is transmitted in the third time...
- the first two columns in the third column are transmitted.
- Number, the third time to send the third and fourth number in column 3 the third time to send the fifth and sixth number in column 3...;
- control circuit of the main processing circuit sends data of a certain number of columns in the matrix P to a certain basic processing circuit each time (for example, for a basic processing circuit, The first number of the third, fourth, and fifth columns is sent once, the second number of the third, fourth, and fifth columns is sent for the second time, and the third, fourth, and fifth columns are sent for the third time.
- the third number ..., or the first two digits of the third, fourth, and fifth columns of each column, the second, the third, fourth, and fifth columns, the third and fourth digits of each column, and the third Send the 5th, 4th, and 5th columns for each of the 5th and 6th numbers...;
- the basic processing circuit After receiving the data of the matrix S, the basic processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 4b). After receiving the data of the matrix P, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 4b) Vertical data path);
- each of the basic processing circuits includes a compression mapping circuit
- the basic processing circuit may determine, according to the operation control of the data, the startup compression mapping circuit.
- the data is subjected to compression processing; further, the compressed data may be transmitted to its next basic processing circuit through its horizontal or vertical data output interface;
- the basic processing circuit After receiving the input matrix S or the matrix P, the basic processing circuit enables the compression mapping circuit to correspond the data in the input matrix S and the matrix P to a specified value (such as 0) and/or the data is less than a preset threshold (such as 0.1).
- Data culling can be culled according to the corresponding mask matrix of the matrix S and the matrix P. For example, the data of the same position corresponding to the mask matrix of 0 in the matrix S/P is culled.
- a specified value such as 0
- a preset threshold such as 0.1
- the matrix S and the matrix P herein may also be correspondingly understood as input neurons (also referred to as input neuron matrices) and weights (also referred to as weight matrices) and the like in the foregoing embodiments.
- the compression mapping circuit when the compression mapping circuit is included in each of the basic processing circuits in the first column and the first row, the data is received for each of the basic processing circuits in the first column or the first row of the device (specifically After the data of the matrix S or the matrix P), it may be determined according to the operation control corresponding to the data whether it is necessary to enable the compression mapping circuit in the respective basic processing circuit to perform compression processing on the data; further, the compressed processed data may pass through the horizontal direction thereof. Or a vertical data output interface is transmitted to the next basic processing circuit; optionally, each of the basic processing circuits in the first column or the first row of the device can directly initiate compression within the data after receiving the data.
- the mapping circuit compresses the data, and then performs subsequent operations, such as sending to other basic processing circuits or performing operations thereon.
- Each of the basic processing circuits operates on the received data.
- the received data may be compressed processed data.
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, In 4b, the bottom row of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the method uses a basic processing circuit array arranged in the manner shown in Figure 4b, assuming that there are h rows, w columns;
- the control circuit of the main processing circuit divides the h-line data of the matrix S into h groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is denoted as Hi);
- the method of grouping the h-line data is any grouping method that does not repeatedly allocate;
- the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
- a part of the lines may be equally distributed first, and the remaining lines may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit divides the W column data of the matrix P into w groups, and the i-th basic processing circuit is responsible for the operation of the i-th group (the set of rows in the set of data is recorded as Wi);
- the method of grouping the W column data here is any grouping method that does not repeatedly allocate;
- the control circuit of the main processing circuit divides the jth row into the jthth basic processing circuit
- a part of the columns may be equally distributed first, and the remaining columns may be allocated in an arbitrary manner.
- the control circuit of the main processing circuit transmits data in part or all of the rows of the matrix S to the first basic processing circuit of each row in the basic processing circuit array;
- control circuit of the main processing circuit transmits one or more of a row of data in the i-th data Hi that it is responsible for to the first basic processing circuit of the i-th row in the basic processing circuit array.
- control circuit of the main processing circuit transmits each of some or all of the i-th data Hi in its responsibility to the first basic processing circuit of the i-th row in the basic processing circuit array.
- the control circuit of the main processing circuit transmits data in part or all of the columns of the matrix P to the first basic processing circuit of each column in the basic processing circuit array;
- control circuit of the main processing circuit transmits one or more of a column of data of the i-th data Wi that it is responsible to to the first basic processing circuit of the i-th column of the basic processing circuit array.
- control circuit of the main processing circuit transmits each of some or all of the columns of the i-th data Ni that it is responsible for to the first basic processing circuit of the i-th column of the basic processing circuit array.
- the basic processing circuit After receiving the data of the matrix S, the basic processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (for example, the white-filled horizontal data path in the middle of the basic processing circuit array in FIG. 4b). After receiving the data of the matrix P, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 4b) Vertical data path);
- each of the basic processing circuits includes a compression mapping circuit
- the basic processing circuit may determine, according to the operation control of the data, the startup compression mapping circuit.
- the data is subjected to compression processing; further, the compressed data can be transmitted to the next basic processing circuit through its horizontal or vertical data output interface; for the compression processing of the data, refer to the related description in the foregoing embodiment. , no longer repeat them here.
- the compression mapping circuit when the compression mapping circuit is included in each of the basic processing circuits in the first column and the first row, the data is received for each of the basic processing units in the first column or the first row of the device (specifically After the data of the matrix S or the matrix P), the data can be compressed; further, the compressed data can be transmitted to its next basic processing circuit through its horizontal or vertical data output interface.
- the compression mapping circuit when the compression mapping circuit is included in each of the basic processing circuits in the first column and the first row, the data is received for each of the basic processing units in the first column or the first row of the device (specifically After the data of the matrix S or the matrix P), the data can be compressed; further, the compressed data can be transmitted to its next basic processing circuit through its horizontal or vertical data output interface.
- Each of the basic processing circuits performs operations on the received data.
- the received data may be compressed processed data;
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and or on the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation
- the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, and if not, the result is output to the direction of the basic processing circuit that can be directly output to the main processing circuit (for example, The following line of the basic processing circuit outputs its output directly to the main processing circuit, and the other basic processing circuits transfer the operation result downward from the vertical output interface).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the weight matrix of the fully connected layer is used as the matrix S, and the input vector is used as the vector P, which is multiplied by the matrix according to the device.
- Vector method performs an operation
- the weight matrix of the fully connected layer is used as the matrix S
- the input vector is used as the matrix P
- the weight of the fully connected layer The value matrix is used as the matrix P
- the input vector is used as the matrix S, and the operation is performed by multiplying the matrix of the device by the matrix;
- the convolution operation is performed using the circuit device:
- the convolution operation is described below.
- One block in the following figure represents a data
- the input data is represented by Figure 6a (N samples, each sample has C channels, and the feature map of each channel has a height H and a width W).
- the weight, that is, the convolution kernel is represented by Figure 6b (there are M convolution kernels, each convolution kernel has C channels, and the height and width are KH and KW, respectively).
- the rules for convolution operations are the same for N samples of input data. The following is a process of convolution on a sample. On one sample, each of the M convolution kernels must perform the same.
- each convolution kernel operation obtains a plane feature map, and M convolution kernels finally calculate M plane feature maps (for one sample, the convolution output is M feature maps), for a convolution kernel
- M convolution kernels For one sample, the convolution output is M feature maps
- FIG. 6c shows a convolution kernel performing inner product operations at the position of the lower right corner of a sample of input data.
- Fig. 6d shows that the position of the convolution slides one space to the left
- Fig. 6e shows that the position of the convolution slides one space upward.
- the control circuitry of the main processing circuit sends the data in some or all of the convolution kernels of the weights to those base processing circuits that are directly connected to the main processing circuitry through the transversal data input interface (eg, the top gray filled vertical in Figure 4b) Data path)
- control circuit of the main processing circuit sends a certain number or part of the data of a convolution kernel to a basic processing circuit each time (for example, for a basic processing circuit, The first time, the first number of the third line is transmitted, the second number of the third line of data is transmitted for the second time, the third number of the third line is transmitted for the third time, ..., or the third line is transmitted for the first time.
- control circuit of the main processing circuit sends the data of some convolution kernels of the weights to each of the basic processing circuits each time (for example, For a certain basic processing circuit, the first number of lines in the 3rd, 4th, and 5th lines is transmitted for the first time, and the second number of the 3rd, 4th, and 5th lines is transmitted for the second time, and the third time is transmitted. 3, 4, 5 lines, the 3rd number of each line..., or the first 2, 4, 5 lines, the first two lines, the second time, the 3rd, 4th, 5th line, the third line And the fourth number, the third time, the third, fourth, and fifth lines, the fifth and sixth numbers of each line...;)
- the control circuit of the main processing circuit divides the input data according to the position of the convolution, and the control circuit of the main processing circuit sends the data in part or all of the convolution position in the input data to the main processing circuit through the vertical data input interface.
- Connected basic processing circuits eg, the gray-filled lateral data path to the left of the basic processing circuit array in Figure 4b;
- control circuit of the main processing circuit sends a data or a part of the data of a certain convolution position in the input data to a basic processing circuit each time; (for example, for a basic processing circuit, The first time, the first number of the third column is transmitted, the second number of the third column data is transmitted for the second time, the third number of the third column is transmitted for the third time..., or the third column is transmitted for the first time.
- the base processing circuit After receiving the weighted data, the base processing circuit transmits the data through its horizontal data output interface to the next basic processing circuit (eg, the white-filled horizontal data path in the middle of the basic processing circuit array in Figure 4b). After receiving the data of the input data, the basic processing circuit transmits the data through its vertical data output interface to the next basic processing circuit connected thereto (for example, the white padded middle of the basic processing circuit array in FIG. 4b) Vertical data path); optionally, after receiving the data (specifically, part or all of the data in the convolution kernel), the basic processing circuit may determine to start the compression mapping circuit pair data according to the operation control of the data. The compression processing is performed; further, the compressed data can be transmitted to the next basic processing circuit through its horizontal or vertical data output interface; for details, refer to the related description in the foregoing embodiment.
- the basic processing circuit may determine to start the compression mapping circuit pair data according to the operation control of the data.
- the compression processing is performed; further, the compressed data can be transmitted to the next basic processing circuit through its horizontal
- the data may be compressed; further The compressed data can be transmitted to the next basic processing circuit through its horizontal or vertical data output interface; for details, refer to the related description in the foregoing embodiment.
- Each of the basic processing circuits operates on the received data, and the received data may be compressed processed data;
- the base processing circuit calculates a multiplication of one or more sets of two data at a time, and then accumulates the result on a register and/or an on-chip buffer;
- the base processing circuit calculates the inner product of one or more sets of two vectors at a time, and then accumulates the result on the register and/or the on-chip buffer;
- the result can be transmitted from the data output interface
- the result of the calculation may be the final result or an intermediate result of the inner product operation; in particular, if the basic processing circuit has an output interface directly connected to the main processing circuit, the result is transmitted from the interface, if If not, the result is output to the direction of the basic processing circuit that can output directly to the main processing circuit (for example, in FIG. 4b, the bottom row of the basic processing circuit outputs the output result directly to the main processing circuit, and the other basic processing circuits are vertically oriented. The output interface transfers the result of the operation down).
- the basic processing circuit After receiving the calculation result from other basic processing circuits, the basic processing circuit transmits the data to other basic processing circuits or main processing circuits connected thereto;
- the main processing circuit receives the result of the inner product operation of each basic processing circuit, and the output result is obtained.
- the present application discloses a neural network computing device that includes functional units for performing all or part of the embodiments provided in the method embodiments described above.
- the present application discloses a chip (as in Figure 7) for performing all or part of the embodiments provided in the method embodiments described above.
- the present application discloses an electronic device that includes functional units for performing all or a portion of the embodiments of the method as described above.
- Electronic devices include data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cameras, camcorders, projectors, watches, headphones, mobile storage , wearables, vehicles, household appliances, and/or medical equipment.
- the vehicle includes an airplane, a ship, and/or a vehicle;
- the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, a rice cooker, a humidifier, a washing machine, an electric lamp, a gas stove, a range hood;
- the medical device includes a nuclear magnetic resonance instrument, B-ultrasound and / or electrocardiograph.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Neurology (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
- Advance Control (AREA)
Abstract
一种集成电路芯片装置及相关产品,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述主处理电路或多个基础处理电路中至少一个基础处理电路包括:压缩映射电路(101),所述压缩映射电路(101)用于执行神经网络运算中的各个数据的压缩处理。所述集成电路芯片装置及相关产品具有计算量小,功耗低的优点。
Description
相关申请:
本申请要求2017年12月30日提交,申请号为201711499267.X,发明名称为“集成电路芯片装置及相关产品”的优先权;
本申请要求2017年12月30日提交,申请号为201711499268.4,发明名称为“集成电路芯片装置及相关产品”的优先权;
本申请要求2017年12月30日提交,申请号为201711499265.0,发明名称为“集成电路芯片装置及相关产品”的优先权;
本申请要求2017年12月30日提交,申请号为201711499266.5,发明名称为“集成电路芯片装置及相关产品”的优先权。
本披露涉及神经网络领域,尤其涉及一种集成电路芯片装置及相关产品。
人工神经网络(Artificial Neural Network,ANN),是20世纪80年代以来人工智能领域兴起的研究热点。它从信息处理角度对人脑神经元网络进行抽象,建立某种简单模型,按不同的连接方式组成不同的网络。在工程与学术界也常直接简称为神经网络或类神经网络。神经网络是一种运算模型,由大量的节点(或称神经元)之间相互联接构成。现有的神经网络的运算基于中央处理器(Central Processing Unit,CPU)或图形处理器(Graphics Processing Unit,GPU)来实现神经网络的运算,此种运算的计算量大,功耗高。
发明内容
本披露实施例提供了一种集成电路芯片装置及相关产品,可提升计算装置的处理速度,提高效率。
第一方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路、k个分支电路以及k组基础处理电路,所述主处理电路与所述k个分支电路分别连接,k个分支电路中的每个分支电路和k组基础处理电路中的一组基础处理电路一一对应,所述一组基础处理电路包括至少一个基础处理电路;
所述分支电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述k个分支电路传输数据;
所述k个分支电路,用于在主处理电路与k组基础电路之间转发所述传输数据,依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;
所述k个基础处理电路,用于依据所述传输数据或压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
第二方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;
所述基础处理电路包括:压缩映射电路;所述压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及向所述多个基础处理电路传输数据;
所述多个基础处理电路,用于依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;依据所述传输数据或压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
第三方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;
所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接所述多个基础处理电路中的k个基础处理电路,所述k个基础电路为:第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路;
所述多个基础处理电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与所述k个基础处理电路传输数据;
所述k个基础处理电路,用于在所述主处理电路以及多个基础处理电路之间的数据转发;
所述多个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
第四方面,提供一种集成电路芯片装置,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;
所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接所述多个基础处理电路中的k个基础处理电路,所述k个基础电路为:第1行的n个基础处理电路以及第1列的m个基础处理电路;
所述k个基础处理电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;
所述k个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,并将压缩处理后的传输数据发送给与所述k个基础处理电路连接的基础处理电路;
所述多个基础处理电路,用于依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
第五方面,提供一种神经网络运算装置,所述神经网络运算装置包括上述第一方面至第四方面中的任一方面所提供的集成电路芯片装置。
第六方面,提供一种组合处理装置,所述组合处理装置包括:第五方面提供的神经网络运算装置、通用互联接口和通用处理装置;
所述神经网络运算装置通过所述通用互联接口与所述通用处理装置连接。
第七方面,提供一种芯片,所述芯片集成上述第一方面至第六方面中的任一方面所提供的装置。
第八方面,提供一种电子设备,所述电子设备包括第七方面的芯片。
第九方面,提供一种神经网络的运算方法,所述方法应用在集成电路芯片装置内,所述集成电路芯片装置包括:第一方面至第四方面中的任一方面所述的集成电路芯片装置,所述集成电路芯片装置用于执行神经网络的运算。
可以看出,通过本披露实施例,提供压缩映射电路将数据块压缩处理后再进行运算,节省了传输资源以及计算资源,所以其具有功耗低,计算量小的优点。
图1a是一种集成电路芯片装置结构示意图。
图1b是另一种集成电路芯片装置结构示意图。
图1c是一种基础处理电路的结构示意图。
图1d为本申请实施例提供的一种压缩映射电路的局部结构示意图。
图1e为本申请实施例提供的一种神经网络结构示意图。
图1f为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图1g为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图1h为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图1i为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图1j为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图1k为本申请实施例提供的另一种压缩映射电路的局部结构示意图。
图2为一种矩阵乘以向量流程示意图。
图2a是矩阵乘以向量的示意图。
图2b为一种矩阵乘以矩阵流程示意图。
图2c是矩阵Ai乘以向量B的示意图。
图2d是矩阵A乘以矩阵B的示意图。
图2e是矩阵Ai乘以矩阵B的示意图。
图3a为神经网络训练示意图。
图3b为卷积运算示意图。
图4a是另一种集成电路芯片装置结构示意图。
图4b是另一种集成电路芯片装置结构示意图。
图4c是一种基础处理电路的结构示意图。
图5a是一种基础处理电路的使用方法示意图。
图5b是一种主处理电路传输数据示意图。
图5c是矩阵乘以向量的示意图。
图5d是一种集成电路芯片装置结构示意图。
图5e是另一种集成电路芯片装置结构示意图。
图5f是矩阵乘以矩阵的示意图。
图6a为卷积输入数据示意图。
图6b为卷积核示意图。
图6c为输入数据的一个三维数据块的运算窗口示意图。
图6d为输入数据的一个三维数据块的另一运算窗口示意图。
图6e为输入数据的一个三维数据块的又一运算窗口示意图。
图7为本申请实施例提供的一种神经网络芯片的结构示意图。
为了使本技术领域的人员更好地理解本披露方案,下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
本申请装置中,所述主处理电路,用于执行神经网络运算中的各个连续的运算以及向所述多个基础处理电路传输数据;所述k组基础处理电路,用于依据所述传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
在可选实施例中,装置还包括k个分支电路,所述主处理电路与所述k个分支电路分别连接,k个分支电路中每个分支电路对应k组基础处理电路中的一组基础处理电路,用于在所述主处理电路与所述k组基础处理电路之间转发传输数据。
在可选实施例中,所述基础处理电路包括压缩映射电路;所述压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;所述k组基础处理电路,具体用于依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;依据所述传输数据或压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
在可选实施例中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与其连接的电路,将所述广播数据块广播至与其连接的电路;所述基础处理电路,用于依据所述运算控制启动所述压缩映射电路对所述基本数据块与所述广播数据块进行压缩处理后再执行内积运算得到运算结果,将所述运算结果发送至主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述待计算的数据块为待计算的至少一个输入神经元,和/或,至少一个权值。
在可选实施例中,所述分支电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述k个分支电路传输数据;所述k个分支电路,用于在主处理电路与k组基 础电路之间转发所述传输数据,依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;所述k个基础处理电路,用于依据所述传输数据或压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
在可选实施例中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与其连接的所述k个分支电路,将所述广播数据块广播至与其连接的所述k个分支电路;所述k个分支电路,用于接收基本数据块以及广播数据块,启动压缩映射电路将该基本数据块以及广播数据块进行压缩处理;将压缩处理后的基本数据块以及压缩处理后的广播数据块转发至k组基础处理电路;所述基础处理电路,用于对所述压缩处理后的基本数据块与所述压缩处理后的广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
在可选实施例中,所述主处理电路,具体用于将所述广播数据块通过一次广播至所述k个分支电路。
在可选实施例中,所述主处理电路,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述K个分支电路。
在可选实施例中,所述基础处理电路,具体用于将所述部分广播数据块与所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主处理电路。
在可选实施例中,所述基础处理电路,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主处理电路,所述n为大于等于2的整数。
在可选实施例中,所述主处理电路包括:主寄存器或主片上缓存电路;
或所述分支电路包括:基本寄存器或基本片上缓存电路;
或所述基础处理电路包括:基本寄存器或基本片上缓存电路。
在可选实施例中,所述主处理电路包括:向量运算器电路、算数逻辑单元电路、累加器电路、矩阵转置电路、直接内存存取电路、压缩映射电路或数据重排电路中的一种或任意组合。
在可选实施例中,所述数据为:向量、矩阵、三维数据块、四维数据块以及n维数据块中一种或任意组合。
在可选实施例中,如所述运算指令为乘法指令,所述主处理电路确定乘数数据块为广播数据块,被乘数数据块为分发数据块;
如所述运算指令为卷积指令,所述主处理电路确定输入数据块为广播数据块,卷积核为分发数据块。
在可选实施例中,本申请涉及的神经网络的运算包括:卷积运算、矩阵乘矩阵运算、矩阵乘向量运算、偏执运算、全连接运算、GEMM运算、GEMV运算、激活运算中的一种 或任意组合。
参阅图1a,图1a为一种集成电路芯片装置的结构示意图,如图1a所示,该芯片装置包括:主处理电路、基本处理电路和分支处理电路(可选的)。其中,集成电路芯片装置包括:主处理电路、k个分支电路(如图1a所示,k=4,当然在实际应用中也可以为其他数值,例如8、16等等数值)以及k组基础处理电路,所述主处理电路与所述k个分支电路分别连接,k个分支电路中每个分支电路对应k组基础处理电路中的一组基础处理电路,所述一组基础处理电路包括至少一个基础处理电路。在实际应用中,压缩映射电路可设置在基础处理电路或者分支电路中,如图以虚线框所示。该压缩映射电路用于对数据进行压缩处理,具体在本申请下文所述。
主处理电路(如图1d所示)可以包括寄存器和/或片上缓存电路,该主处理电路还可以包括:控制电路、向量运算器电路、算数逻辑单元(arithmetic and logic unit,ALU)电路、累加器电路、直接内存存取(Direct Memory Access,DMA)电路等电路,当然在实际应用中,上述主处理电路还可以添加,转换电路(例如矩阵转置电路)、数据重排电路或激活电路等等其他的电路;
可选的,主处理电路可以包括:压缩映射电路,压缩映射电路可以用于对接收或发送的数据进行压缩处理,在实际应用中例如将为0或者小于预设阈值(如0.1)的数据进行剔除。所述预设阈值为用户侧或终端设备侧自定义设置的,例如0.1、0.05等等。本申请并不限制上述压缩映射电路的具体形式。关于所述压缩处理将在下文进行具体阐述。
主处理电路还包括数据发送电路、数据接收电路或接口,该数据发送电路可以集成数据分发电路以及数据广播电路,当然在实际应用中,数据分发电路以及数据广播电路也可以分别设置;在实际应用中上述数据发送电路以及数据接收电路也可以集成在一起形成数据收发电路。对于广播数据,主处理电路即需要将该广播数据发送给每个基础处理电路的数据。对于分发数据,主处理电路即需要有选择的将分发数据发送给部分基础处理电路的数据,具体的选择方式可以由主处理电路依据负载以及计算方式进行具体的确定。对于广播发送方式,即将广播数据以广播形式发送至每个基础处理电路。(在实际应用中,通过一次广播的方式将广播数据发送至每个基础处理电路,也可以通过多次广播的方式将广播数据发送至每个基础处理电路,本申请具体实施方式并不限制上述广播的次数),对于分发发送方式,即将分发数据有选择的发送给部分基础处理电路。
在实现分发数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据,该数据可以相同,也可以不同。具体的,如果采用分发的方式发送数据,各个接收数据的基础处理电路收到的数据可以不同,当然也可以有部分基础处理电路收到的数据相同;
具体地,广播数据时,主处理电路的控制电路向部分或者全部基础处理电路传输数据,各个接收数据的基础处理电路可以收到相同的数据。
可选的,上述主处理电路的向量运算器电路可以执行向量运算,包括但不限于:两个向量加减乘除,向量与常数加、减、乘、除运算,或者对向量中的每个元素执行任意运算。其中,连续的运算具体可以为,向量与常数加、减、乘、除运算、激活运算、累加运算等等。
每个基础处理电路可以包括基础寄存器和/或基础片上缓存电路;每个基础处理电路还 可以包括:内积运算器电路、向量运算器电路、累加器电路等中一个或任意组合。上述内积运算器电路、向量运算器电路、累加器电路都可以是集成电路,上述内积运算器电路、向量运算器电路、累加器电路也可以为单独设置的电路。
该芯片装置可选的还可以包括一个或多个分支处理电路,如该芯片装置具有分支处理电路时,其中主处理电路与分支处理电路连接,该分支处理电路与基本处理电路连接,该基本处理电路的内积运算器电路用于执行数据块之间的内积运算,该主处理电路的控制电路控制数据接收电路或数据发送电路收发外部数据,以及通过控制电路控制数据发送电路将外部数据分发至分支处理电路,该分支处理电路用于收发主处理电路或基本处理电路的数据。如图1a所示的结构适合复杂数据的计算,因为对于主处理电路来说,其连接的单元的数量有限,所以需要在主处理电路与基本处理电路之间添加分支处理电路以实现更多的基本处理电路的接入,从而实现对复杂数据块的计算。分支处理电路和基础处理电路的连接结构可以是任意的,不局限在图1a的H型结构。可选的,主处理电路到基础处理电路是广播或分发的结构,基础处理电路到主处理电路是收集(Gather)的结构。广播结构,分发结构和收集结构的定义如下:对于分发或广播结构,此时的基础处理电路的数量大于主处理电路,即1个主处理电路对应多个基础处理电路,即从主处理电路到多个基础处理电路为广播或分发的结构,反之,从多个基础处理电路到主处理电路可以为收集结构。
基础处理电路,接收主处理电路分发或者广播的数据保存到基础处理电路的片上缓存中,可以进行运算产生结果,可以向主处理电路发送数据。
基础处理电路中所涉及到的数据可以是经过压缩处理后的数据,其中压缩处理涉及的具体实施方式将在后续进行阐述。
可选的,每个基础处理电路均可以包括压缩映射电路,也可以在部分基础处理电路配置压缩映射电路;压缩映射电路可以用于对接收或发送的数据进行压缩处理。本申请并不限制上述压缩映射电路的具体形式。
可选的,该基础处理电路的向量运算器电路可以对压缩处理后的两个向量执行向量运算,当然在实际应用中,基础处理电路的内积运算器电路可以对压缩处理后的两个向量执行内积运算,累加器电路也可以对内积运算的结果进行累加。
在一种可选方案中,两个向量可以存放在片上缓存和/或寄存器中,基础处理电路可以根据实际计算的需要提取两个向量执行运算。该运算包括但不限于:内积运算、乘法运算、加法运算或其他的运算。
在一种可选方案中,内积运算的结果可以累加到片上缓存和/或寄存器上;其可选方案的优点是,减少了基础处理电路和主处理电路之间的数据传输量,提高了运算效率,降低了数据传输功耗。
在一种可选方案中,内积运算的结果不进行累加,直接作为结果传输;此技术方案的优点是,减少了基础处理电路内部的运算量,提高基础处理电路的运算效率。
在一种可选方案中,每个基础处理电路可以执行多组两个向量的内积运算,也可以对多组内积运算的结果分别进行累加;
在一种可选方案中,多组的两个向量数据可以存放在片上缓存和/或寄存器中;
在一种可选方案中,多组内积运算的结果可以分别累加到片上缓存和/或寄存器中;
在一种可选方案中,各组内积运算的结果可以不进行累加,直接作为结果传输;
在一种可选方案中,每个基础处理电路可以执行同一个向量与多个向量分别进行内积运算的操作(“一对多”内积,即多组内积里每组的两个向量中有一个向量是共享的),并将每个向量对应的内积结果分别进行累加。此技术方案可以实现同一套权值对不同的输入数据进行多次计算,增加了数据复用,减少基础处理电路内部数据的数据传输量,提高计算效率,降低功耗。
具体地,计算内积使用的数据中,各组共享的向量和每组的另一个向量(即每组之间不同的那个向量)的数据来源可以不同:
在一种可选方案中,在计算内积时,各组共享的向量来自主处理电路或者分支处理电路的广播或者分发;
在一种可选方案中,在计算内积时,各组共享的向量来自片上缓存;
在一种可选方案中,在计算内积时,各组共享的向量来自寄存器;
在一种可选方案中,在计算内积时,每组的另一个非共享向量来自主处理电路或者分支处理电路的广播或者分发;
在一种可选方案中,在计算内积时,每组的另一个非共享向量来自从片上缓存;
在一种可选方案中,在计算内积时,每组的另一个非共享向量来自寄存器;
在一种可选方案中,在进行多组的内积运算时,每组共享的向量在基础处理电路的片上缓存和/寄存器中保留任意份数;
在一种可选方案中,共享向量可以对应每组内积各保留一份;
在一种可选方案中,共享向量可以只保留一份;
具体地,多组内积运算的结果可以分别累加到片上缓存和/或寄存器中;
具体地,各组内积运算的结果可以不进行累加,直接作为结果传输;
参阅图1a所示的结构,其包含一主处理电路(可以执行向量操作),多基础处理电路(可以执行内积操作)。这样组合的好处是:装置不仅能使用基础处理电路执行矩阵和向量乘法运算,也能使用主处理电路执行其他任意的向量运算,使装置在有限的硬件电路的配置下,能够更快的完成更多的运算,减少了与装置外部进行数据传输的次数,提高了计算效率,降低了功耗。另外,本芯片在基础处理电路和/或主处理电路均可以设置压缩映射电路,这样在进行神经网络计算时能够减少计算的数据量,并且本芯片可以依据各个电路(主要是主处理电路和基础处理电路)的运算量(负载量)动态地分配由哪个电路来进行数据的压缩处理,这样能够减少数据计算的复杂程序,降低功耗,并且动态的分配数据的压缩处理能够实现不影响芯片的计算效率。该分配的方式包括但不限于:负载均衡、负载最小值分配等等方式。
参阅图1b所示的装置,图1b所示的装置包括主处理电路和基础处理电路,可选地还可包括分支处理电路。如图1b所示的装置包括:主处理电路以及N个基础处理电路,其中,主处理电路(具体的结构如图1c所示)与N个基础处理电路可以直接或间接连接,如为间接连接的方式时,一种可选的方案如图1a所示可以包括N/4个分支处理电路,每个分支处理电路分别连接4个基础处理电路,对于主处理电路以及N个基础处理电路分别包含的电路可以参见上述如图1a所示的描述,这里不再赘述,这里需要说明的是,上述基础 处理电路还可以设置在分支处理电路内,另外,每个分支处理电路连接基础处理电路的数量也可以不局限于4个,厂家可以根据实际需要进行配置。该上述主处理电路和/或N个基础处理电路均可以包括压缩映射电路,具体的,可以是主处理电路包括压缩映射电路,也可以是N个基础处理电路或其中的一部分包括压缩映射电路,也可以是主处理电路和N个基础处理电路或其中的一部分均包括。上述主处理电路可以根据神经网络计算指令动态的分配数据压缩处理步骤的操作实体,具体的,主处理电路可以根据自身的负载确定是否对接收到的数据执行数据压缩处理,具体的,可以将负载的值设置多个区间,每个区间对应分配数据压缩处理步骤的执行主体,例如,以3个区间为例,区间1的负载值较低,可以由主处理电路单独执行数据压缩处理步骤,区间2负载值位于区间1以及区间3之间,可以由主处理电路或N个基础处理电路共同执行数据压缩处理步骤,区间3负载值较高,可以由N个基础处理电路执行数据压缩处理步骤。对此,可以以明示的方式来执行,例如主处理电路可以配置一个特殊指示或指令,当基础处理电路接收到该特殊指示或指令时,确定执行数据压缩处理步骤,如基础处理电路未接收到特殊指示或指令时,确定不执行数据压缩处理步骤。又如,可以以暗示的方式来执行,例如,基础处理电路接收到稀疏数据(即含0,或包括小于预设阈值的数据大于预设数量)且确定需要执行内积运算时,将该稀疏数据进行压缩处理。
下面阐述本申请涉及的数据压缩处理的相关实施例。需要说明的是,本申请中的数据可以是神经网络中的输入神经元或权值,其具体可为矩阵数据或向量数据等,本申请不做限定。也即是本申请下文阐述的数据或数据块可为神经网络中的输入神经元或权值,它们可以矩阵或向量等形式体现。
由于神经网络是一个高计算量和高访存的算法,权值越多,计算量和访存量都会增大。特别是,针对权值较小(如为0,或小于设定数值的权值)的情况下,为提高计算速率、减小开销需对这些权值较小的数据进行压缩处理。在实际应用中,数据压缩处理在稀疏神经网络中应用,效果最为明显,如减小数据计算的工作量、减小数据额外开销,提高数据计算速率等。
以输入数据为例,具体阐述所述压缩映射电路的数据压缩处理实施例。所述输入数据包括但不限于至少一个输入神经元和/或至少一个权值。
第一种实施例中:压缩映射电路对输入神经元和权值均进行压缩处理
压缩映射电路101接收到输入数据(具体可为主压缩处理电路发送的所述待计算的数据块)之后,可对所述输入数据进行压缩处理,以得到压缩处理后的输入数据,所述输入数据包括至少一个输入神经元和至少一个权值,所述压缩处理后的输入数据包括压缩处理后的输入神经元和压缩处理后的权值。
上述输入数据包括至少一个输入神经元和至少一个权值。上述压缩映射电路101确定所述至少一个输入神经元中每个输入神经元的绝对值是否大于第一阈值。当上述输入神经元的绝对值小于或者等于该第一阈值时,上述压缩映射电路101将该输入神经元删除;当上述输入神经元的绝对值大于上述第一阈值时,上述压缩映射电路101保留该输入神经元,该压缩映射电路101将删除后的输出神经元输出,作为压缩处理后的输入神经元。上述压缩映射电路101获取输入神经元的连接关系数据,该输入神经元的连接关系数据表示上述 至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置信息。上述压缩映射电路101确定上述至少一个权值中每个权值的绝对值是否大于第二阈值。当权值的绝对值小于或者等于上述第二阈值时,上述压缩映射电路101将该权值删除,并根据上述输入神经元的连接关系数据将从上述删除后的权值中选择相关的权值输出,作为压缩处理后的权值。
在一种可行的实施例中,上述输入数据包括至少一个输入神经元和至少一个权值。上述压缩映射电路101确定所述至少一个权值中每个权值的绝对值是否大于第二阈值。当上述权值的绝对值小于或者等于该第二阈值时,上述压缩映射电路101将该权值删除;当上述权值的绝对值大于上述第二阈值时,上述压缩映射电路101保留该权值,该压缩映射电路101将删除后的权值输出,作为压缩处理后的权值。上述压缩映射电路101获取权值的连接关系数据,该权值的连接关系数据表示上述至少一个输入神经元与输出神经元之间的连接关系的数据。上述压缩映射电路101确定上述至少一个输入神经元中每个输入神经元的绝对值是否大于第一阈值。当输入神经元的绝对值小于或者等于上述第一阈值时,上述压缩映射电路101将该输入神经元删除,并根据上述权值的连接关系数据将从上述删除后的输入神经元中选择相关的输入神经元输出,作为压缩处理后的输入神经元。
进一步地,上述压缩映射电路101将上述压缩处理后的输入神经元和压缩处理后的权值按照一一对应的格式存储到存储电路中。
具体地,上述压缩映射电路101对上述压缩处理后的输入神经元和上述压缩处理后的权值按照一一对应的格式进行存储的具体方式是将上述压缩处理后的输入神经元中的每个压缩处理后的输入神经元和与其对应的压缩处理后的权值作为一个数据集,并将该数据集存储到存储电路中。
具体地,如图1d所示,上述压缩映射电路101包括:
第一稀疏处理单元1011,用于对第二输入数据进行压缩处理,以得到第三输出数据和第二输出数据,并将所述第三输出数据传输至第一数据处理单元1012。
第一数据处理单元1012,用于接收第一输入数据和接收所述第三输出数据,并根据上述第三输出数据和第一输入数据输出第一输出数据。
其中,当所述第一输入数据包括至少一个输入神经元,所述第二输入数据包括至少一个权值时,所述第一输出数据为压缩处理后的输入神经元,所述第二输出数据为压缩处理后的权值,所述第三输出数据为权值的连接关系数据;当所述第一输入数据包括至少一个权值,所述第二输入数据包括至少一个输入神经元时,所述第一输出数据为压缩处理后的权值,所述第二输出数据为压缩处理后的输入神经元,所述第三输出数据为输入神经元的连接关系数据。
具体地,当上述第二输入数据为权值时,且权值的形式为wij,该wij表示第i个输入神经元与第j个输出神经元之间的权值;上述第一稀疏处理单元1011根据权值确定上述连接关系数据(即上述第三输出数据),并将上述权值中绝对值小于或者等于第二阈值的权值删除,得到压缩处理后的权值(即上述第二输出数据);当上述第二输入数据为输入神经元时,上述第一稀疏处理单元1011根据输入神经元得到连接关系数据,并将该输入神经元中的绝对值小于或等于上述第一阈值的输入神经元删除,以得到压缩处理后的输入神经元。
可选地,上述第一阈值可为0.1、0.08、0.05、0.02、0.01、0或者其他值。上述第二阈值可为0.1、0.08、0.06、0.05、0.02、0.01、0或者其他值。上述第一阈值和上述第二阈值可以一致,也可以不一致。
其中,上述连接关系数据可以步长索引或者直接索引的形式表示。
具体地,以直接索引形式表示的连接关系数据为由0和1组成的字符串,当上述第二输入数据为权值时,0表示该权值的绝对值小于或者等于上述第二阈值,即该权值对应的输入神经元与输出神经元之间没有连接,1表示该权值的绝对值大于上述第二阈值,即该权值对应的输入神经元与输出神经元之间有连接。以直接索引形式表示的连接关系数据有两种表示顺序:以每个输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示权值的连接关系;或者每个输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示权值的连接关系。当上述第二输入数据为输入神经元时,0表示该输入神经元的绝对值小于或者等于上述第一阈值,1表示该输入神经元的绝对值大于上述第一阈值。
应理解的,所述连接关系数据也可用向量/矩阵等形式体现,其中,0表示该位置对应的输入神经元/权值的数据为0或者小于第一阈值;相应地,1表示该位置对应的输入神经元/权值的数据不为0或者大于第一阈值等,本申请不做限定。可选的,所述数据的连接关系数据也可称为标记mask矩阵/mask向量。
当上述第二输入数据为权值时,以步长索引形式表示的连接关系数据为与输出神经元有连接的输入神经元与上一个与该输出神经元有连接的输入神经元之间的距离值组成的字符串;当上述第二输入数据为输入神经元时,以步长索引表示的数据以当前绝对值大于上述第一阈值的输入神经元与上一个绝对值大于上述第一阈值的输入神经元之间的距离值组成的字符串表示。
举例说明,假设上述第一阈值和上述第二阈值均为0.01,参见图1e,图1e为本申请实施例提供的一种神经网络的示意图。如图1e中的a图所示,上述第一输入数据为输入神经元,包括输入神经元i1、i2、i3和i4,上述第二输入数据为权值。对于输出神经元o1,权值为w11,w21,w31和w41;对于输出神经元o2,权值w12,w22,w32和w42,其中权值w21,w12和w42的值为0,其绝对值均小于上述第一阈值0.01,上述第一稀疏处理单元1011确定上述输入神经元i2和输出神经元o1没有连接,上述输入神经元i1和i4与输出神经元o2没有连接,上述输入神经元i1、i3和i4与上述输出神经元o1有连接,上述输入神经元i2和i3与输出神经元o2有连接。以每个输出神经元与所有输入神经元的连接状态表示上述连接关系数据,则上述输出神经元o1的连接关系数据为“1011”,输出神经元o2的连接关系数据为“0110”(即上述连接关系数据为“10110110”);以每个输入神经元与所有输出神经元的连接关系,则输入神经元i1的连接关系数据为“10”,输入神经元i2的连接关系数据为“01”,输入神经元i3的连接关系数据为“11”,输入神经元i4的连接关系数据为“10”(即上述连接关系数据为“10011110”)。
对于上述输出神经元o1,上述压缩映射电路101将上述i1与w11,i3与w31和i4与w41分别作为一个数据集,并将该数据集存储到存储电路中;对于输出神经元o2,上述压缩映射电路101将上述i2与w22和i3与w32分别作为一个数据集,并将该数据集存储到 存储电路中。
针对上述输出神经元o1,上述第二输出数据为w11,w31和w41;针对上述输出神经元o2,上述第二输出数据为w22和w32。
当上述第二输入数据为输入神经元i1、i2、i3和i4,且该输入神经元的值分别为1,0,3,5则上述连接关系数据(即上述第三输出数据)为“1011”,上述第二输出数据为1,3,5。
如图1e中的b图所示,上述第一输入数据包括输入神经元i1、i2、i3和i4,上述第二输入数据为权值。对于输出神经元o1,权值为w11,w21,w31和w41;对于输出神经元o2,权值w12,w22,w32和w42,其中权值w21,w12和w42的值为0,上述稀疏处理单元1011确定上述输入神经元i1、i3和i4与上述输出神经元o1有连接,上述输入神经元i2和i3与输出神经元o2有连接。上述输出神经元o1与输入神经元之间的连接关系数据为“021”。其中,该连接关系数据中第一个数字“0”表示第一个与输出神经元o1有连接的输入神经元与第一个输入神经元之间的距离为0,即第一个与输出神经元o1有连接的输入神经元为输入神经元i1;上述连接关系数据中第二个数字“2”表示第二个与输出神经元o1有连接的输入神经元与第一个与输出神经元o1有连接的输入神经元(即输入神经元i1)之间的距离为2,即第二个与输出神经元o1有连接的输入神经元为输入神经元i3;上述连接关系数据中第三个数字“1”表示第三个与输出神经元o1有连接的输入神经元与第二个与该输出神经元o1有连接的输入神经元之间的距离为1,即第三个与输出神经元o1有连接的输入神经元为输入神经元i4。
上述输出神经元o2与输入神经元之间的连接关系数据为“11”。其中,该连接关系数据中的第一数字“1”表示第一个与输出神经元o2有连接的输入神经元与第一个输入神经元(即输入神经元i1)之间的距离为,即该第一个与输出神经元o2有连接关系的输入神经元为输出神经元i2;上述连接关系数据中的第二数字“1”表示第二个与输出神经元o2有连接的输入神经元与第一个与输出神经元o2有连接的输入神经元的距离为1,即第二个与输出神经元o2有连接的输入神经元为输入神经元i3。
对于上述输出神经元o1,上述压缩映射电路101将上述i1与w11,i3与w31和i4与w41分别作为一个数据集,并将该数据集存储到存储电路中;对于输出神经元o2,上述压缩映射电路101将上述i2与w22和i3与w32分别作为一个数据集,并将该数据集存储到存储电路中。
针对上述输出神经元o1,上述第二输出数据为w11,w31和w41;针对上述输出神经元o2,上述第二输出数据为w22和w32。
当上述第二输入数据为输入神经元i1、i2、i3和i4,且该输入神经元的值分别为1,0,3,5则上述连接关系数据即上述第三输出数据为“021”,上述第二输出数据为1,3,5。
当上述第一输入数据为输入神经元时,则上述第二输入数据为权值,上述第三输出数据为输出神经元与上述输入神经元之间的连接关系数据。上述第一数据处理单元1012接收到上述输入神经元后,将该输入神经元中绝对值小于或等于上述第二阈值的输入神经元剔除,并根据上述连接关系数据,从剔除后的输入神经元中选择与上述权值相关的输入神经元,作为第一输出数据输出。
举例说明,假设上述第一阈值为0,上述输入神经元i1、i2、i3和i4,其值分别为1,0,3和5,对于输出神经元o1,上述第三输出数据(即连接关系数据)为“021”,上述第二输出数据为w11,w31和w41。上述第一数据处理单元1012将上述输入神经元i1、i2、i3和i4中值为0的输入神经元剔除,得到输入神经元i1、i3和i4。该第一数据处理单元1012根据上述第三输出数据“021”确定上述输入神经元i1、i3和i4均与上述输出神经元均有连接,故上述数据处理单元1012将上述输入神经元i1、i3和i4作为第一输出数据输出,即输出1,3,5。
当上述第一输入数据为权值,上述第二输入数据为输入神经元时,上述第三输出数据为上述输入神经元的连接关系数据。上述第一数据处理单元1012接收到上述权值w11,w21,w31和w41后,将该权值中绝对值小于上述第一阈值的权值剔除,并根据上述连接关系数据,从上述剔除后的权值中选择与该上述输入神经元相关的权值,作为第一输出数据并输出。
举例说明,假设上述第二阈值为0,上述权值w11,w21,w31和w41,其值分别为1,0,3和4,对于输出神经元o1,上述第三输出数据(即连接关系数据)为“1011”,上述第二输出数据为i1,i3和i5。上述第一数据处理单元1012将上述权值w11,w21,w31和w41中值为0的输入神经元剔除,得到权值w11,w21,w31和w41。该第一数据处理单元1012根据上述第三输出数据“1011”确定上述输入神经元i1、i2,i3和i4中的输入神经元i2的值为0,故上述第一数据处理单元1012将上述输入神经元1,3和4作为第一输出数据输出。
在一种可行的实施例中,第三输入数据和第四输入数据分别为至少一个权值和至少一个输入神经元,上述压缩映射电路101确定上述至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置,并获取输入神经元的连接关系数据;上述压缩映射电路101确定上述至少一个权值中绝对值大于上述第二阈值的权值的位置,并获取权值的连接关系数据。上述压缩映射电路101根据上述权值的连接关系数据和输入神经元的连接关系数据得到一个新的连接关系数据,该连接关系数据表示上述至少一个输入神经元中绝对值大于上述第一阈值的输入神经元与输出神经元之间的关系和对应的权值的值。压缩映射电路101根据该新的连接关系数据、上述至少一个输入神经元和上述至少一个权值获取压缩处理后的输入神经元和压缩处理后的权值。
进一步地,上述压缩映射电路101将上述压缩处理后的输入神经元和压缩处理后的权值按照一一对应的格式存储到存储电路中。
具体地,上述压缩映射电路101对上述压缩处理后的输入神经元和上述压缩处理后的权值按照一一对应的格式进行存储的具体方式是将上述压缩处理后的输入神经元中的每个压缩处理后的输入神经元和与其对应的压缩处理后的权值作为一个数据集,并将该数据集存储到存储电路中。
对于压缩映射电路101包括第一稀疏处理单元1011和第一数据处理单元1012的情况,压缩映射电路101中的稀疏处理单元1011对输入神经元或者权值进行稀疏化压缩处理,减小了权值或者输入神经元的数量,进而减小了运算单元进行运算的次数,提高了运算效率。
具体地,如图1f所示,上述压缩映射电路101包括:
第二稀疏处理单元1013,用于接收到第三输入数据后,根据所述第三输入数据得到第 一连接关系数据,并将该第一连接关系数据传输至连接关系处理单元1015;
第三稀疏处理单元1014,用于接收到第四输入数据后,根据所述第四输入数据得到第二连接关系数据,并将该第二连接关系数据传输至所述连接关系处理单元1015;
所述连接关系处理单元1015,用于根据所述第一连接关系数据和所述第二连接关系数据,以得到第三连接关系数据,并将该第三连接关系数据传输至第二数据处理单元1016;
所述第二数据处理单元1016,用于在接收到所述第三输入数据,所述第四输入数据和所述第三连接关系数据后,根据所述第三连接关系数据对所述第三输入数据和所述第四输入数据进行压缩处理,以得到第四输出数据和第五输出数据;
其中,当所述第三输入数据包括至少一个输入神经元,第四输入数据包括至少一个权值时,所述第一连接关系数据为输入神经元的连接关系数据,所述第二连接关系数据为权值的连接关系数据,所述第四输出数据为压缩处理后的输入神经元,所述第五输出数据为压缩处理后的权值;当所述第三输入数据包括至少一个权值,所述第四输入数据包括至少一个输入神经元时,所述第一连接关系数据为权值的连接关系数据,所述第二连接关系数据为输入神经元的连接关系数据,所述第四输出数据为压缩处理后的权值,所述第五输出数据为压缩处理后的输入神经元。
当上述第三输入数据包括至少一个输入神经元时,上述第一连接关系数据为用于表示该至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置的字符串;当上述第三输入数据包括至少一个权值时,上述第一连接关系数据为用于表示输入神经元与输出神经元之间是否有连接的字符串。
当上述第四输入数据包括至少一个输入神经元时,上述第二连接关系数据为用于表示该至少一个输入神经元中绝对值大于上述第一阈值的输入神经元的位置的字符串;当上述第四输入数据包括至少一个权值时,上述第二连接关系数据为用于表示输入神经元与输出神经元之间是否有连接的字符串。
需要说明的是,上述第一连接关系数据、第二连接关系数据和第三连接关系数据均可以步长索引或者直接索引的形式表示,具体可参见上述相关描述。
换句话说,上述连接关系处理单元1015对上述第一连接关系数据和上述第二连接关系数据进行压缩处理,以得到第三连接关系数据。该第三连接关系数据可以直接索引或者步长索引的形式表示。
具体地,当上述第一连接关系数据和上述第二连接关系数据均以直接索引的形式表示时,上述连接关系处理单元1015对上述第一连接关系数据和上述第二连接关系数据进行与操作,以得到第三连接关系数据,该第三连接关系数据是以直接索引的形式表示的。
需要说明的是,表示上述第一连接关系数据和第二连接关系数据的字符串在内存中是按照物理地址高低的顺序存储的,可以是由高到低的顺序存储的,也可以是由低到高的顺序存储的。
当上述第一连接关系数据和上述第二连接关系数据均以步长索引的形式表示,且表示上述第一连接关系数据和第二连接关系数据的字符串是按照物理地址由低到高的顺序存储时,上述连接关系处理单元1015将上述第一连接关系数据的字符串中的每一个元素与存储物理地址低于该元素存储的物理地址的元素进行累加,得到的新的元素组成第四连接关系 数据;同理,上述连接关系处理单元1015对上述第二连接关系数据的字符串进行同样的压缩处理,得到第五连接关系数据。然后上述连接关系处理单元1015从上述第四连接关系数据的字符串和上述第五连接关系数据的字符串中,选取相同的元素,按照元素值从小到大的顺序排序,组成一个新的字符串。上述连接关系处理单元1015将上述新的字符串中将每一个元素与其相邻且值小于该元素值的元素进行相减,以得到一个新的元素。按照该方法,对上述新的字串中的每个元素进行相应的操作,以得到上述第三连接关系数据。
举例说明,假设以步长索引的形式表示上述第一连接关系数据和上述第二连接关系数据,上述第一连接关系数据的字符串为“01111”,上述第二连接关系数据的字符串为“022”,上述连接关系处理单元1015将上述第一连接关系数据的字符串中的每个元素与其相邻的前一个元素相加,得到第四连接关系数据“01234”;同理,上述连接关系处理单元1015对上述第二连接关系数据的字符串进行相同的压缩处理后得到的第五连接关系数据为“024”。上述连接关系处理单元1015从上述第四连接关系数据“01234”和上述第五连接关系数据“024”选组相同的元素,以得到新的字符串“024”。上述连接关系处理单元1015将该新的字符串中的每个元素与其相邻的前一个元素进行相减,即0,(2-0),(4-2),以得到上述第三连接数据“022”。
当上述第一连接关系数据和上述第二连接关系数据中的任意一个以步长索引的形式表示,另一个以直接索引的形式表示时,上述连接关系处理单元1015将上述以步长索引表示的连接关系数据转换成以直接索引的表示形式或者将以直接索引表示的连接关系数据转换成以步长索引表示的形式。然后上述连接关系处理单元1015按照上述方法进行压缩处理,以得到上述第三连接关系数据(即上述第五输出数据)。
可选地,当上述第一连接关系数据和上述第二连接关系数据均以直接索引的形式表示时,上述连接关系处理单元1015将上述第一连接关系数据和上述第二连接关系数据均转换成以步长索引的形式表示的连接关系数据,然后按照上述方法对上述第一连接关系数据和上述第二连接关系数据进行压缩处理,以得到上述第三连接关系数据。
具体地,上述第三输入数据可为输入神经元或者权值、第四输入数据可为输入神经元或者权值,且上述第三输入数据和第四输入数据不一致。上述第二数据处理单元1016根据上述第三连接关系数据从上述第三输入数据(即输入神经元或者权值)中选取与该第三连接关系数据相关的数据,作为第四输出数据;上述第二数据处理单元1016根据上述第三连接关系数据从上述第四输入数据中选取与该第三连接关系数据相关的数据,作为第五输出数据。
进一步地,上述第二数据处理单元1016将上述压缩处理后的输入神经元与其对应的压缩处理后的权值作为一个数据集,将该数据集存储出存储电路中。
举例说明,假设上述第三输入数据包括输入神经元i1,i2,i3和i4,上述第四输入数据包括权值w11,w21,w31和w41,上述第三连接关系数据以直接索引方式表示,为“1010”,则上述第二数据处理单元1016输出的第四输出数据为输入神经元i1和i3,输出的第五输出数据为权值w11和w31。上述第二数据处理单元1016将输入神经元i1与权值w11和输入神经元i3与权值w31分别作为一个数据集,并将这两个数据集存储到存储电路中。
对于压缩映射电路101包括第二稀疏处理单元1013,第三稀疏处理单元1014、连接关 系处理单元1015和第二数据处理单元1016的情况,压缩映射电路101中的稀疏处理单元对输入神经元和权值均进行稀疏化压缩处理,使得输入神经元和权值的数量进一步减小,进而减小了运算单元的运算量,提高了运算效率。
可选地,所述压缩映射电路101对所述输入数据进行压缩处理之前,所述压缩映射电路101还用于:
对所述至少一个输入神经元进行分组,以得到M组输入神经元,所述M为大于或者等于1的整数;
判断所述M组输入神经元的每一组输入神经元是否满足第一预设条件,所述第一预设条件包括一组输入神经元中绝对值小于或者等于第三阈值的输入神经元的个数小于或者等于第四阈值;
当所述M组输入神经元任意一组输入神经元不满足所述第一预设条件时,将该组输入神经元删除;
对所述至少一个权值进行分组,以得到N组权值,所述N为大于或者等于1的整数;
判断所述N组权值的每一组权值是否满足第二预设条件,所述第二预设条件包括一组权值中绝对值小于或者等于第五阈值的权值的个数小于或者等于第六阈值;
当所述N组权值任意一组权值不满足所述第二预设条件时,将该组权值删除。
可选地,上述第三阈值可为0.5,0.2,0.1,0.05,0.025,0.0,0或者其他值。上述第四阈值与上述一组输入神经元中输入神经元的个数相关。可选地,该第四阈值=一组输入神经元中的输入神经元个数-1或者该第四阈值为其他值。可选地,上述第五阈值可为0.5,0.2,0.1,0.05,0.025,0.01,0或者其他值。其中,上述第六阈值与上述一组权值中的权值个数相关。可选地,该第六阈值=一组权值中的权值个数-1或者该第六阈值为其他值。
需要说明的是,上述第三阈值和上述第五阈值可相同或者不同,上述第四阈值和上述第六阈值可相同或者不同。可选的,存储电路可用于存储上述压缩处理后的输入神经元、压缩处理后的权值和相关的运算指令。
在可选实施例中,如图1g所示的压缩映射电路在已知输入数据的连接关系数据的情况下,可利用该输入数据的连接关系数据对所述输入数据进行压缩处理。所述输入数据包括至少一个输入神经元或者至少一个权值。具体如图1g所示,上述压缩映射电路601包括:
输入数据缓存单元6011,用于缓存第一输入数据,该第一输入数据包括至少一个输入神经元或者至少一个权值。
连接关系缓存单元6012,用于缓存第一输入数据的连接关系数据,即上述输入神经元的连接关系数据或者上述权值的连接关系数据。
其中,上述输入神经元的连接关系数据为用于表示该输入神经元中绝对值是否小于或者等于第一阈值的字符串,上述权值的连接关系数据为表示该权值绝对值是否小于或者等于上述第一阈值的字符串,或者为表示该权值对应的输入神经元和输出神经元之间是否有连接的字符串。该输入神经元的连接关系数据和权值的连接关系数据可以直接索引或者步长索引的形式表示。
需要说明的是,上述直接索引和步长索引的描述可参见上述图1b所示实施例的相关描述。
第四稀疏处理单元6013,用于根据所述第一输入数据的连接关系数据对所述第一输入数据进行压缩处理,以得到压缩处理后的第一输入数据,并将该压缩处理后的第一输入数据存储到上述第一输入缓存单元中605。
其中,当上述第一输入数据为至少一个输入神经元时,上述第四稀疏处理单元6013在一个时钟周期压缩处理一个输入神经元和一个连接关系,即在一个时钟周期从S1个输入神经元中选择一个有效的输入神经元,S1为大于1的整数。
在一种可行的实施例中,上述第四稀疏处理单元6013在一个时钟周期压缩处理多个输入神经元和多个连接关系数据,即一个时钟周期从S1个输入神经元中选出有效的S2个输入数据,上述S2为大于0且小于或者等于该S1的整数。
举例说明,如图1h所示,上述输入神经元为i1,i2,i3和i4,以直接索引的形式表示的连接关系数据为“1011”,并且上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择1个有连接(即有效)的输入神经元。上述第四稀疏处理单元6013从上述输入数据缓存单元6011和上述连接关系缓存单元6012中分别获取上述输入神经元i1,i2,i3和i4和上述连接关系数据“1011”后,上述第四稀疏处理单元6013根据该连接关系数据“1011”从上述输入神经元i1,i2,i3和i4选取有连接的输入神经元i1,i3和i4。由于上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择1个有连接(即有效)的输入神经元,该第四稀疏处理单元6013在三个时钟周期内依次输出输入神经元i1,i3和i4,如图1h所示。上述第四稀疏处理单元6013将上述输入神经元i1,i3和i4存储到第一输入缓存单元中。
再举例说明,如图1i所示,输入神经元为i1,i2,i3和i4,以直接索引的形式表示的连接关系数据有两组,分别为“1011”和“0101”,上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元中选择2个有连接(即有效)的输入神经元。上述第四稀疏处理单元6013根据上述连接关系数据“1011”从上述输入神经元i1,i2,i3和i4中选择有连接的输入神经元i1,i3和i4;根据上述连接关系数据“0101”从上述输入神经元i1,i2,i3和i4中选择有连接的输入神经元i2和i4。由于上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择2个有连接(即有效)的输入神经元,对于连接关系数据“1011”,该第四稀疏处理单元6013在第一个时钟周期从上述输入神经元i1,i2和i4中选择输入神经元i1和i3,并将该输入神经元i1和i3存储到上述第一输入缓存单元606中,在第二个时钟周期从上述输入神经元i1,i2和i4中选择输入神经元i4,并将该输入神经元i4存储到上述第一输入缓存单元606中;对于连接关系数据“0101”,该第四稀疏处理单元6013在一个时钟周期从上述输入神经元i2和i4中选择输入神经元i2和i4,如图1i所示。上述第四稀疏处理单元6013将上述输出神经元i2和i4和存储到第一输入缓存单元中。
举例说明,如图1j所示,输入数据为输入神经元i1,i2,i3和i4,以步长索引的形式表示的连接关系数据为“021”,并且上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择1个有连接(即有效)的输入神经元。上述第四稀疏处理单元6013从上述输入数据缓存单元6011和上述连接关系缓存单元6012中分别获取上述输入神经元i1,i2,i3和i4和上述连接关系数据“021”后,上述第四稀疏处理单元6013根据该连接关系数据“1011”从上述输入神经元i1,i2,i3和i4选取有连接的输入神经元i1,i3和i4。由于上述 第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择1个有连接(即有效)的输入神经元,该第四稀疏处理单元6013在三个时钟周期内依次输出输入神经元i1,i3和i4,如图1j所示。上述第四稀疏处理单元6013将上述输入神经元i1,i3和i4存储到第一输入缓存单元中。
再举例说明,如图1k所示,输入数据为输入神经元i1,i2,i3和i4,以步长索引的形式表示的连接关系数据有两组,分别为“021”和“22”,上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元中选择2个有连接(即有效)的输入神经元。上述第四稀疏处理单元6013根据上述连接关系数据“021”从上述输入神经元i1,i2,i3和i4中选择有连接的输入神经元i1,i3和i4;根据上述连接关系数据“22”从上述输入神经元i1,i2,i3和i4中选择有连接的输入神经元i2和i4。由于上述第四稀疏处理单元6013在一个时钟周期可从4个输入神经元选择2个有连接(即有效)的输入神经元,对于连接关系数据“021”,该第四稀疏处理单元6013在第一个时钟周期从选择输入神经元i1和i3,并将该输入神经元i1和i3存储到上述第一输入缓存单元606中。在第二个时钟周期从选择输入神经元i4并将该输入神经元i4存储到上述第一输入缓存单元606中;对于连接关系数据“22”,该第四稀疏处理单元6013在一个时钟周期从选择输入神经元i2和i4并输出,如图1k所示。上述第四稀疏处理单元6013将上述输入神经元i2和i4存储到第一输入缓存单元中。
在一种可行的实施例中,上述输入数据缓存单元6011用于缓存的第一输入数据包括至少一个权值,上述连接关系缓存单元6012缓存的数据为上述权值的连接关系数据,且上述至少一个权值的绝对值均大于第一阈值时,上述第四稀疏处理单元6013根据上述权值的连接关系数据,将没有连接关系的输入神经元和输出神经元之间的权值的值置为0,并将该值为0的权值和上述至少一个权值存储到第二输入缓存单元中。
举例说明,权值的形式为wij,表示第i个输入神经元与第j个输出神经元之间的权值。假设输入神经元包括i1,i2,i3和i4,输出神经元包括o1,上述第一输入数据(权值)为w11,w31,w41,上述第一输入数据的连接关系数据(即上述权值的连接关系数据)以直接索引的形式表示,为1011,上述第四稀疏处理单元6013根据上述第二输入数据确定上述输入神经元i2与上述输出神经元o1之间没有连接,上述第四稀疏处理单元6013将该上述输入神经元i2与上述输出神经元o1之间的权值w21的值置为0,并将w11,w21(0),w31,w41存储到第二输入缓存单元中。
上述第一输入缓存单元,用于缓存上述压缩处理后的输入神经元。上述第二输入缓存单元,用于缓存从存储电路中读取的压缩处理的权值。
在一种可行的实施例中,当上述第一输入数据为至少一个权值时,上述第四稀疏处理单元6013在一个时钟周期压缩处理一个权值和一个连接关系,即在一个时钟周期从S3个权值中选择一个有效的权值,该S3为大于1的整数。
可选地,上述第四稀疏处理单元6013在一个时钟周期压缩处理多个权值和多个连接关系数据,即一个时钟周期从S3个权值中选出有效的S4个权值,上述S4为大于0且小于或者等于该S3的整数。
上述第一输入缓存单元,用于缓存上述压缩处理后的权值。上述第二输入缓存单元,用于缓存从存储电路中读取的压缩处理的输入神经元。
需要说明的是,上述相关描述可参见前述实施例中的相关描述,在此不再叙述。
可选地,所述压缩映射电路601对所述第一输入数据进行压缩处理之前,所述压缩映射电路601还用于:对所述至少一个输入神经元进行分组,以得到M组输入神经元,所述M为大于或者等于1的整数;判断所述M组输入神经元的每一组输入神经元是否满足第一预设条件,所述第一预设条件包括一组输入神经元中绝对值小于或者等于第三阈值的输入神经元的个数小于或者等于第四阈值;当所述M组输入神经元任意一组输入神经元不满足所述第一预设条件时,将该组输入神经元删除;对所述至少一个权值进行分组,以得到N组权值,所述N为大于或者等于1的整数;判断所述N组权值的每一组权值是否满足第二预设条件,所述第二预设条件包括一组权值中绝对值小于或者等于第五阈值的权值的个数小于或者等于第六阈值;当所述N组权值任意一组权值不满足所述第二预设条件时,将该组权值删除。
需要说明的是,上述相关描述可参见前述实施例中的相关描述,在此不再叙述。上述第一阈值、第二阈值、第三阈值、第四阈值、第五阈值和第六阈值可均存储在存储电路或者第一输出缓存单元中;上述第一阈值、第二阈值、第三阈值、第四阈值、第五阈值和第六阈值中部分阈值存储在存储电路、部分阈值存储在第一输出缓存单元中。上述第一输入缓存单元、上述第二输入缓存单元和上述输出缓存单元均可为所述压缩映射电路或所述主处理电路中的功能单元,也可为其他处理电路共享的功能单元,本申请不做限定。
在一种可选实施例中,所述输入神经元的连接关系数据和所述权值的连接关系数据是由0或1表示的字符串/矩阵组成,其中0表示所述输入神经元/所述权值的绝对值小于或等于第一阈值,1表示所述输入神经元/所述权值的绝对值大于第一阈值,与输出神经元无关。本实施例中,连接关系数据(即所述神经元/权值的连接关系数据)也可称为mask矩阵。
本申请中权值的连接关系数据和/或神经元的连接关系数据的表示方式除了直接索引和步长索引之外,还可为以下几种情况:列表的列表(List of Lists,LIL)、坐标列表(Coordinate list,COO)、压缩稀疏行(Compressed Sparse Row,CSR)、压缩稀疏列(Compressed Sparse Column,CSC)、(ELL Pack,ELL)以及混合(Hybird,HYB)等等,本申请不做详述。
此外,本申请实施例中提到的输入神经元和输出神经元并非是指整个神经网络的输入层中的神经元和输出层中的神经元,而是对于神经网络中任意相邻的两层神经元,处于网络前馈运算下层中的神经元即为输入神经元,处于网络前馈运算上层中的神经元即为输出神经元。以卷积神经网络为例,假设一个卷积神经网络有L层,K=1,2,3…L-1,对于第K层和第K+1层来说,第K层被称为输入层,该层中的神经元为上述输入神经元,第K+1层被称为输入层,该层中的神经元为上述输出神经元,即除了顶层之外,每一层都可以作为输入层,其下一层为对应的输出层。
下面提供一种采用如图1a所示的装置实现计算的方法,该计算的方法具体可以为神经网络的计算方式,例如神经网络的正向运算,神经网络的训练,在实际应用中,正向运算依据不同的输入数据可以执行矩阵乘矩阵、卷积运算、激活运算、变换运算等等运算,上述运算均可以采用如图1a所示的装置实现。
主处理电路的控制电路将数据通过分支处理电路传输给基础处理电路;其中,分支处 理电路可通过压缩映射电路先对数据进行压缩处理然后再转发给基础处理电路运算。例如,分支处理电路的压缩处理电路对数据进行压缩处理后再将压缩处理后的数据传输给基础处理电路,其优点是可以减少传输数据的数据量,减少传输的总比特数量,基础处理电路执行数据运算的效率也更高,功耗更低。
如分支处理电路接收到的数据为稀疏数据,那么分支处理电路可以收到数据后由压缩映射电路对数据进行压缩处理然后再进行计算,例如,分支处理电路收到主处理电路传输过来的稀疏数据,压缩映射电路将其进行压缩处理,然后发送给基础处理电路的内积运算器电路、向量运算器电路或累加器电路对压缩处理后的数据进行运算,提高运算效率,降低功耗。
主处理电路将待计算的数据传输到全部或者一部分基础处理电路上;以矩阵乘以向量计算为例,主处理电路的控制电路可以将矩阵数据拆分每列作为一个基础数据,例如m*n矩阵,可以拆分成n个m行的向量,主处理电路的控制电路将拆分后的n个m行的向量分发给多个基础处理电路。对于向量,主处理电路的控制电路可以将向量整体广播给每个基础处理电路。如果m的值比较大,那么控制电路可以先将m*n矩阵拆分成x*n个向量,以x=2为例,具体的可以拆分成,2n个向量,每个向量包含m/2行,即将n个m行的向量中每个向量均分成2个向量,以第一行为例,如n个m行的向量的第一个向量为1000行,那么均分成2个向量可以为,将前500行组成第一向量,将后500行组成第二向量,控制电路通过2个广播将2个向量广播给多个基础处理电路。
所述数据传输的方式可以是广播或者分发,或者其他任何可能的传输方式;
基础处理电路接收到数据后,执行运算,得到运算结果;
基础处理电路将运算结果传输回主处理电路;
所述运算结果可以是中间运算结果,也可以是最终运算结果。
使用如图1a所示装置完成矩阵乘向量的运算;
(矩阵乘向量可以是矩阵中的每一行分别与向量进行内积运算,并将这些结果按对应行的顺序摆放成一个向量。)
下面描述计算尺寸是M行L列的矩阵S和长度是L的向量P的乘法的运算,如图2a所示,(矩阵S中的每一行与向量P长度相同,他们中的数据按位置一一对应)所述神经网络计算装置拥有K个基础处理电路:
参阅图2,图2提供了了一种矩阵乘向量的实现方法,具体可以包括:
步骤S201、主处理电路的控制电路将矩阵S中的每一行数据分发到K个基础处理电路中的某一个上,基础处理电路将接收到的分发数据保存在基础处理电路的片上缓存和/或寄存器中;
在一种可选方案中,当装置包括分支电路时,分支电路中包括压缩映射电路。主处理电路的控制电路将输入矩阵S(M行L列)中的每一行数据通过分支处理电路进行压缩处理后再分发到K个基础处理电路中的某一个上,基础处理电路将接收到的分发数据保存在基础处理电路的片上缓存和/或寄存器中。
具体的,分支处理电路可接收到主处理电路分发的输入矩阵S1(M1行L1列)其中,M1小于等于M,L1小于等于L。即S1属于S的一部分,即前文所述的分发数据块。进一 步地,分支处理电路的压缩映射电路将输入矩阵S1(M1行L1列)中的每一行数据进行压缩处理得到压缩处理后的矩阵S2(M2行L2列)。然后再将压缩处理后的矩阵S2转发给基础处理电路。其中,M大于等于M1,且大于等于M2。L大于等于L1,且大于等于L2。
例如,压缩映射电路将输入矩阵S2和矩阵P2中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S2和矩阵P2各自对应的mask矩阵来剔除,例如剔除mask矩阵中数据为0时对应的相同位置上矩阵S2/P2中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经原矩阵)和权值(也可称为权值矩阵)等。
在一种可选方案中,如果矩阵S的行数M<=K则,主处理电路的控制电路给K个基础处理电路分别分发S矩阵的一行数据;
在一种可选方案中,如果矩阵S的行数M>K,则主处理电路的控制电路给每个基础处理电路分别分发S矩阵中一行或多行的数据。
分发到第i个基础处理电路的S中的行的集合为Ai,共有Mi个行,如图2c表示第i个基础处理电路上将要执行的计算。
在一种可选方案中,在每个基础处理电路中,例如第i个基础处理电路中,可以将接收到的分发数据例如矩阵Ai保存在第i个基础处理电路的寄存器和/或片上缓存中;优点是减少了之后的分发数据的数据传输量,提高了计算效率,降低了功耗。
步骤S202、主处理电路的控制电路将向量P中各部分以广播的方式传输给K个基础处理电路;
在一种可选方案中,当装置包括分支电路时,分支电路中包括压缩映射电路。主处理电路的控制电路将输入向量P(长度为L)中各个部分以广播的方式通过对应的分支处理电路进行压缩处理后再传输给K个基础处理电路;
具体的,分支处理电路可接收到主处理电路分发的输入向量P1(长度L1)其中,L1小于等于L。P1属于P的一部分,即前文所述的广播数据块。进一步地,分支处理电路的压缩映射电路将输入向量P1(长度L1)中的数据进行压缩处理得到压缩处理后的向量P2(L2列)。然后在将压缩处理后的向量P2转发给基础处理电路。其中,L2小于等于L1,且小于等于L。
在一种可选方案中,主处理电路的控制电路可以将向量P中各部分只广播一次到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对这一次得到的向量P的数据进行充分地复用,完成对应与矩阵Ai中每一行的内积运算。优点是,减少从主处理电路到基础处理电路的向量P的重复传输的数据传输量,提高执行效率,降低传输功耗。
在一种可选方案中,主处理电路的控制电路可以将向量P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的向量P的数据不进行复用,分次完成对应于矩阵Ai中的每一行的内积运算;优点是,减少基础处理电路内部的单次传输的向量P的数据传输量,并可以降低基础处理电路缓存和/或寄存器的容量,提高执行效率,降低传输功耗,降低成本。
在一种可选方案中,主处理电路的控制电路可以将向量P中各部分多次广播到各个基 础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的向量P的数据进行部分复用,完成对应于矩阵Ai中的每一行的内积运算;优点是,减少从主处理电路到基础处理电路的数据传输量,也减少基础处理电路内部的数据传输量,提高执行效率,降低传输功耗。
步骤S203、K个基础处理电路各自的内积运算器电路计算矩阵S和向量P的数据的内积,例如第i个基础处理电路,计算矩阵Ai的数据和向量P的数据的内积;
在一种可选方案中,当装置中的压缩映射电路设置在基础处理电路时,
基础处理电路接收到主处理电路发送的矩阵S和向量P后,可利用基础处理电路中的压缩映射电路先对矩阵S和向量P进行压缩处理,然后再利用内积运算器电路计算压缩处理后的矩阵S和向量P的数据的内积。
具体的,压缩映射电路对输入矩阵S(M1行L1列)进行压缩处理得到压缩处理后的矩阵S(M行L列)。例如,将输入矩阵S和向量P中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S和向量P各自对应的mask矩阵来剔除,例如剔除mask矩阵中数据为0时对应的相同位置上矩阵S/P中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经元矩阵)和权值(也可称为权值矩阵)等。
步骤S204、K个基础处理电路的累加器电路将内积运算的结果进行累加得到累加结果,将累加结果以定点类型形式传输回主处理电路。
在一种可选方案中,可以将每次基础处理电路执行内积运算得到的部分和(部分和即累加结果的一部分,例如累加结果为:F1*G1+F2*G2+F3*G3+F4*G4+F5*G5,那么部分和可以为:F1*G1+F2*G2+F3*G3的值)传输回主处理电路进行累加;优点是,减少了基础处理电路内部的运算量,提高基础处理电路的运算效率。
在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和保存在基础处理电路的寄存器和/或片上缓存中,累加结束之后传输回主处理电路;优点是,减少了基础处理电路和主处理电路之间的数据传输量,提高了运算效率,降低了数据传输功耗。
在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和在部分情况下保存在基础处理电路的寄存器和/或片上缓存中进行累加,部分情况下传输到主处理电路进行累加,累加结束之后传输回主处理电路;优点是,减少了基础处理电路和主处理电路之间的数据传输量,提高了运算效率,降低了数据传输功耗,减少了基础处理电路内部的运算量,提高基础处理电路的运算效率。
参阅图2b,使用如图1a所示的装置完成矩阵乘矩阵的运算;
下面描述计算尺寸是M行L列的矩阵S和尺寸是L行N列的矩阵P的乘法的运算,(矩阵S中的每一行与矩阵P的每一列长度相同,如图2d所示)所述神经网络计算装置拥有K个基础处理电路:
步骤S201b、主处理电路的控制电路将矩阵S中的每一行数据分发到K个基础处理电路中的某一个上,基础处理电路将接收到的数据保存在片上缓存和/或寄存器中;
在一种可选方案中,分支处理电路中设置有压缩映射电路,主处理电路的控制电路将 矩阵S中的每一行数据通过分支处理电路进行压缩处理后再分发到K个基础处理电路中的某一个上,基础处理电路将接收到的数据保存在片上缓存和/或寄存器中;
具体的,主处理电路的控制电路将输入矩阵S(M行L列)中的每一行数据通过分支处理电路进行压缩处理后再分发到K个基础处理电路中的某一个上。相应地,分支处理电路可接收到主处理电路分发的输入矩阵S1(M1行L1列)其中,M1小于等于M,L1小于等于L。进一步地,分支处理电路的压缩映射电路将输入矩阵S1(M1行L1列)中的每一行数据进行压缩处理得到压缩处理后的矩阵S2(M2行L2列)。然后在将压缩处理后的矩阵S2转发给对应的基础处理电路。其中,M大于等于M1,且大于等于M2。L大于等于L1,且大于等于L2。
例如,压缩映射电路将输入矩阵S2和矩阵P2中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S2和矩阵P2各自对应的mask矩阵来剔除,例如剔除mask矩阵中数据为0时对应的相同位置上矩阵S2/P2中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经原矩阵)和权值(也可称为权值矩阵)等。
在一种可选方案中,如果S的行数M<=K则,主处理电路的控制电路给M个基础处理电路分别分发S矩阵的一行;
在一种可选方案中,如果S的行数M>K,主处理电路的控制电路给每个基础处理电路分别分发S矩阵中一行或多行的数据。
S中有Mi行分发到第i个基础处理电路,这Mi行的集合称为Ai,如图2e表示第i个基础处理电路上将要执行的计算。
在一种可选方案中,在每个基础处理电路中,例如第i个基础处理电路中:
接收的由主处理电路分发的矩阵Ai,将矩阵Ai保存在第i个基础处理电路寄存器和/或片上缓存中;优点是减少了之后的数据传输量,提高了计算效率,降低了功耗。
步骤S202b、主处理电路的控制电路将矩阵P中各部分以广播的方式传输给各个基础处理电路;
在一种可选方案中,分支处理电路中设置有压缩映射电路,主处理电路的控制电路将矩阵P中各部分以广播的方式通过分支处理电路压缩处理后再传输给各个基础处理电路;
具体的,分支处理电路可接收到主处理电路分发的输入向量P1(长度L1)其中,L1小于等于L。P1属于P的一部分,即前文所述的广播数据块。进一步地,分支处理电路的压缩映射电路将输入向量P1(长度L1)中的数据进行压缩处理得到压缩处理后的向量P2(L2列)。然后再将压缩处理后的向量P2转发给基础处理电路。其中,L2小于等于L1,且小于等于L。
在一种可选方案中,可以将矩阵P中各部分只广播一次到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对这一次得到的矩阵P的数据进行充分地复用,完成对应与矩阵Ai中每一行的内积运算;本实施例中的复用具体可以为基础处理电路在计算中重复使用,例如矩阵P的数据的复用,可以是对矩阵P的数据在多次使用。
在一种可选方案中,主处理电路的控制电路可以将矩阵P中各部分多次广播到各个基 础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的矩阵P的数据不进行复用,分次完成对应于矩阵Ai中的每一行的内积运算;
在一种可选方案中,主处理电路的控制电路可以将矩阵P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的矩阵P的数据进行部分复用,完成对应于矩阵Ai中的每一行的内积运算;
在一种可选方案中,每个基础处理电路,例如第i个基础处理电路,计算矩阵Ai的数据和矩阵P的数据的内积;
步骤S203b、每个基础处理电路的累加器电路将内积运算的结果进行累加并传输回主处理电路。
在一种可选方案中,基础处理电路中设置有压缩映射电路,则内积运算的结果可为基础处理电路对矩阵S和矩阵P进行压缩处理后,再利用内积运算器电路计算压缩处理后的矩阵S和向量P的数据的内积的结果。
具体的,压缩映射电路对输入矩阵S(M1行L1列)和输入矩阵P(L1行N1列)进行压缩处理得到压缩处理后的矩阵S(M行L列)和矩阵P(L行N列);进一步地基础处理单元的运算器可对压缩处理后的矩阵S和矩阵P进行内积运算,以得到内积运算的结构。例如,将输入矩阵S和矩阵P中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S和矩阵P各自对应的mask矩阵来剔除,例如剔除mask矩阵中数据为0时对应的相同位置上矩阵S/P中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经原矩阵)和权值(也可称为权值矩阵)等。
在一种可选方案中,基础处理电路可以将每次执行内积运算得到的部分和传输回主处理电路进行累加;
在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和保存在基础处理电路的寄存器和/或片上缓存中,累加结束之后传输回主处理电路;
在一种可选方案中,也可以将每次基础处理电路执行的内积运算得到的部分和在部分情况下保存在基础处理电路的寄存器和/或片上缓存中进行累加,部分情况下传输到主处理电路进行累加,累加结束之后传输回主处理电路;
参阅图3a,使用如图1a所示的装置完成全连接运算:
如果全连接层的输入数据是一个向量(即神经网络的输入是单个样本的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为向量P,按照所述装置的使用方法一执行如图2所示的矩阵乘向量的运算;
如果全连接层的输入数据是一个矩阵(即神经网络的输入是多个样本作为batch的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为矩阵P,或者以全连接层的权值矩阵作为矩阵P,输入向量作为矩阵S,按照所述装置的使用如图2c所示的矩阵乘矩阵的执行运算;
参阅图3b,使用如图1a所示的装置完成卷积运算:
对于一个卷积层,记其卷积核的数量为M;
步骤S301、主处理电路的控制电路将卷积层权值中的每一个卷积核的权值分发到K个基础处理电路中的某一个上,保存在基础处理电路的片上缓存和/或寄存器中;
在一种可选方案中,分支处理电路中包括压缩映射电路,则主处理电路的控制电路将卷积层权值中的每一个卷积核的权值通过分支处理电路压缩处理后再分发到K个基础处理电路中的某一个上,保存在基础处理电路的片上缓存和/或寄存器中;
具体的,分支处理电路接收到主处理电路发送的所述卷积层权值中的每一个卷积核的权值后,可利用分支处理电路的压缩映射电路对卷积层权值中的每一个卷积核的权值进行压缩处理,以对应得到压缩处理后的所述卷积层权值中的每一个卷积核的权值,然后再转发给基础处理电路进行运算。关于数据的压缩处理可参见前述实施例中的相关阐述,这里不再赘述。
在一种可选方案中,如果卷积核的个数M<=K则,主处理电路的控制电路给M个基础处理电路分别分发一个卷积核的权值;
在一种可选方案中,如果卷积核的个数M>K,主处理电路的控制电路给每个基础处理电路分别分发一个或多个卷积核的权值。
共有Mi个卷积核分发到第i个基础处理电路,这些卷积核权值的集合称为Ai。
在一种可选方案中,在每个基础处理电路中,例如第i个基础处理电路中:
将收到的由主处理电路分发的卷积核权值Ai保存在其寄存器和/或片上缓存中;
步骤S302、主处理电路的控制电路将输入数据P中各部分以广播的方式传输给各个基础处理电路;
在一种可选方案中,分支处理电路包括压缩映射电路,则主处理电路的控制电路将输入数据P中各部分以广播的方式通过相应地分支处理电路压缩处理后再转发给各个基础处理电路,这里不再赘述。
在一种可选方案中,主处理电路的控制电路可以将输入数据P中各部分只广播一次到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对这一次得到的输入数据P的数据进行充分地复用,完成对应与Ai中每一个卷积核的内积运算;
在一种可选方案中,主处理电路的控制电路可以将输入数据P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的输入数据P的数据不进行复用,分次完成对应于Ai中的每一个卷积核的内积运算;
在一种可选方案中,主处理电路的控制电路可以将输入数据P中各部分多次广播到各个基础处理电路的寄存器或者片上缓存中,第i个基础处理电路对每次得到的输入数据P的数据进行部分复用,完成对应于Ai中的每一个卷积核的内积运算;
步骤S303、每个基础处理电路计算卷积核和输入数据P的数据内积,例如第i个基础处理电路,计算Ai的每一个卷积核和输入数据P的数据的内积;
在一种可选方案中,基础处理电路包括压缩映射电路时,则基础处理电路接收到主处理电路发送的卷积核和输入数据P后,可利用基础处理电路中的压缩映射电路先对卷积核和输入数据P进行压缩处理,然后再利用内积运算器电路计算压缩处理后的卷积核和输入数据P的数据的内积。例如,第i个基础处理电路,计算压缩处理后的Ai的每一个卷积核和压缩处理后的输入数据P的数据的内积。
步骤S304、每个基础处理电路的累加器电路将内积运算的结果进行累加并传输回主处理电路:
在一种可选方案中,可基础处理电路以将每次执行内积运算得到的部分和传输回主处理电路进行累加;
在一种可选方案中,基础处理电路也可以将每次执行的内积运算得到的部分和保存在基础处理电路的寄存器和/或片上缓存中,累加结束之后传输回主处理电路;
在一种可选方案中,基础处理电路也可以将每次执行的内积运算得到的部分和在部分情况下保存在基础处理电路的寄存器和/或片上缓存中进行累加,部分情况下传输到主处理电路进行累加,累加结束之后传输回主处理电路;
使用如图1a所示的装置更新权值的方法:
利用主处理电路的向量运算器电路实现神经网络训练过程中的权值更新功能,具体地,权值更新是指使用权值的梯度来更新权值的方法。
在一种可选方案中,使用主处理电路的向量运算器电路对权值和权值梯度这两个向量进行加减运算得到运算结果,该运算结果即为更新权值。
在一种可选方案中,使用主处理电路的向量运算器电路在权值以及权值梯度乘以或除以一个数得到中间权值和中间权值梯度值,向量运算器电路对中间权值和中间权值梯度值进行加减运算得到运算结果,该运算结果即为更新权值。
在一种可选方案中,可以先使用权值的梯度计算出一组动量,然后再使用动量与权值进行加减计算得到更新后的权值。
本申请还提供一种芯片,该芯片包含计算装置,该计算装置包括:
一个主处理电路,主处理电路中所涉及到的数据可以是压缩处理后的数据,在一种可选实施例中,所述压缩处理后的数据包括至少一个输入神经元或至少一个权值,所述至少一个神经元中的每个神经元大于第一阈值或者,所述至少一个权值中的每个权值大于第二阈值。所述第一阈值和所述第二阈值为用户侧自定义设置的,它们可以相同,也可不同。
在一种可选方案中,主处理电路包括压缩映射电路;在一种可选方案中,主处理电路包括执行数据压缩处理的运算单元,例如向量运算单元;具体地,包含接收输入数据的数据输入接口;
在一种可选方案中,该计算装置还包括:一个分支处理电路,分支处理电路中所涉及到的数据可以是压缩处理后的数据,在一种可选实施例中,所述压缩处理后的数据包括至少一个输入神经元或至少一个权值,所述至少一个神经元中的每个神经元大于第一阈值或者,所述至少一个权值中的每个权值大于第二阈值。所述第一阈值和所述第二阈值为用户侧自定义设置的,它们可以相同,也可不同。
在一种可选方案中,分支处理电路包括压缩映射电路;
在一种可选方案中,分支处理电路包括执行数据压缩处理的运算单元,如向量运算单元等;具体地,包含接收输入数据的数据输入接口;
在一种可选方案中,所述接收的数据来源可以是:所述神经网络运算电路装置的外部或所述神经网络运算电路装置的部分或全部基础处理电路;
在一种可选方案中,所述数据输入接口可以有多个;具体地,可以包含输出数据的数 据输出接口;
在一种可选方案中,所述输出的数据的去向可以是:所述神经网络运算装置的外部或所述神经网络运算电路装置的部分或全部基础处理电路;
在一种可选方案中,所述数据输出接口可以有多个;
在一种可选方案中,所述分支处理电路包括片上缓存和/或寄存器;
在一种可选方案中,所述分支处理电路中包含运算单元,可以执行数据运算;
在一种可选方案中,所述分支处理电路中包含算术运算单元;
在一种可选方案中,所述分支处理电路中包含向量运算单元,可以同时对一组数据执行运算;具体地,所述算术运算和/或向量运算可以是任意类型的运算,包括但不限于:两个数相加减乘除,一个数与常数加减乘除,对一个数执行指数运算,幂次运算,对数运算,以及各种非线性运算,对两个数执行比较运算,逻辑运算等。两个向量相加减乘除,一个向量中的每一个元素与常数加减乘除,对向量中的每一个元素执行指数运算,幂次运算,对数运算,以及各种非线性运算等,对一个向量中的每两个对应的元素执行比较运算,逻辑运算等。
在一种可选方案中,所述主处理电路包括数据重排列单元,用于按照一定的顺序向基础处理电路传输数据,或者按照一定的顺序原地重新排列数据;
在一种可选方案中,所述数据排列的顺序包括:对一个多维数据块进行维度顺序的变换;所述数据排列的顺序还可以包括:对一个数据块进行分块以发送到不同的基础处理电路。
该计算装置还包括多个基础处理电路:每一个基础处理电路用于计算两个向量的内积,计算的方法是,基础处理电路收到的两组数,将这两组数中的元素对应相乘,并且将相乘的结果累加起来;内积的结果传输出去,这里传输出去根据基础处理电路的位置,有可能传输给其他基础处理电路,也可以直接传输给主处理电路。
基础处理电路中所涉及到的数据可以是压缩处理后的数据,在一种可选实施例中,所述压缩处理后的数据包括至少一个输入神经元或至少一个权值,所述至少一个神经元中的每个神经元大于第一阈值或者,所述至少一个权值中的每个权值大于第二阈值。所述第一阈值和所述第二阈值为用户侧自定义设置的,它们可以相同,也可不同。
在一种可选方案中,基础处理电路包括压缩映射电路;
在一种可选方案中,基础处理电路包括执行数据压缩处理的向量运算单元;
具体地,包括由片上缓存和/或寄存器构成的存储单元;
具体地,包括一个或多个接收数据的数据输入接口;
在一种可选方案中,包括两个数据输入接口,每次从两个数据输入接口处可以分别获得一个或多个数据;
在一种可选方案中,基础处理电路可以将从数据输入接口接收到输入数据后保存在寄存器和/或片上缓存中;
上述数据输入接口接收数据的来源可以是:其他基础处理电路和/或主处理电路。
所述神经网络运算电路装置的主处理电路;
所述神经网络运算电路装置的其他基础处理电路(所述神经网络运算电路装置拥有多 个基础处理电路);
具体地,包括一个或多个传输输出数据的数据输出接口;
在一种可选方案中,可以将一个或多个数据从数据输出接口传输出去;
具体地,通过数据输出接口传输出去的数据可以是:从数据输入接口接收到的数据、保存在片上缓存和/或寄存器中的数据、乘法器运算结果、累加器运算结果或内积运算器运算结果中的一种或任意组合。
在一种可选方案中,包含三个数据输出接口,其中的两个分别对应于两个数据输入接口,每一层输出上一层从数据输入接口接收到的数据,第三个数据输出接口负责输出运算结果;
具体地,所述数据输出接口传输数据的去向可以是:上文数据来源和此处的数据去向决定了基础处理电路在装置中的连接关系。
所述神经网络运算电路装置的主处理电路;
所述神经网络运算电路装置的其他基础处理电路,所述神经网络运算电路装置拥有多个基础处理电路;
具体地,包括算术运算电路:该算术运算电路具体可以为:一个或多个乘法器电路、一个或多个累加器电路、一个或多个执行两组数内积运算的电路中的一个或任意组合。
在一种可选方案中,可以执行两个数的乘法运算,其结果可以保存在片上缓存和/或寄存器上,也可以直接累加到寄存器和/或片上缓存中;
在一种可选方案中,可以执行两组数据的内积运算,其结果可以保存在片上缓存和/或寄存器中,也可以直接累加到寄存器和/或片上缓存中;
在一种可选方案中,可以执行数据的累加运算,将数据累加到片上缓存和或寄存器中;
具体地,累加器电路被累加的数据,可以是:从数据输入接口接收到的数据、保存在片上缓存和/或寄存器中的数据、乘法器运算结果、累加器运算结果、内积运算器运算结果中的一个或任意组合。
需要说明的是,上述对基础处理电路的描述中所用到的“数据输入接口”和“数据输出接口”是指每一个基础处理电路的数据输入与输出接口,而不是整个装置的数据输入与输出接口。
在本申请另一方面提供的集成电路芯片装置中,包括:主处理电路以及多个基础处理电路;
所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接所述多个基础处理电路中的k个基础处理电路,所述k个基础电路为:第1行的n个基础处理电路以及第1列的m个基础处理电路;
所述多个基础处理电路中的部分或所有基础处理电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与所述k个基础处理电路传输数据;
所述k个基础处理电路,用于在所述主处理电路以及多个基础处理电路之间的数据转 发;
所述部分或所有基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。
在一种可选方案中,在所述多个基础处理电路均包括压缩映射电路时,所述多个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述k个基础处理电路传输给所述主处理电路。
在一种可选方案中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与所述K个基础处理电路,将所述广播数据块广播至与所述k个基础处理电路;所述多个基础处理电路,用于依据接收到的基础数据块、广播数据块以及运算指令启动所述压缩映射电路将基础数据块和广播数据块进行压缩处理,对压缩处理后的所述基本数据块与压缩处理后的所述广播数据块执行内积运算得到运算结果,将运算结果通过所述k个基础处理电路传输给所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
在一种可选方案中,在所述多个基础处理电路中的所述k个基础处理电路均包括压缩映射电路时,所述k个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据行压缩处理,并将压缩处理后的传输数据发送给与所述k个基础处理电路连接的基础处理电路;所述多个基础处理电路,用于依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。
在一种可选方案中,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与所述K个基础处理电路,将所述广播数据块广播至与所述k个基础处理电路;所述k个基础处理电路,用于依据接收到的基础数据块、广播数据块以及运算指令启动所述压缩映射电路将基础数据块和广播数据块进行压缩处理,然后传输给与所述k个基础处理电路连接的基础处理电路;所述多个基础处理电路,用于对压缩处理后的所述基本数据块与所述广播数据块执行内积运算得到运算结果,并将所述运算结果发送至所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
参阅图4a,图4a为本披露提供的一种集成电路芯片装置,该集成电路芯片装置包括:主处理电路和多个基础处理电路,所述多个基础处理电路呈阵列排布(m*n阵列),其中,m、n的取值范围为大于等于1的整数且m、n中至少有一个值大于等于2。对于m*n阵列 分布的多个基础处理电路,每个基础处理电路与相邻的基础处理电路连接,所述主处理电路连接多个基础处理电路中的k个基础处理电路,所述k个基础处理电路可以为:第1行的n个基础处理电路、第m行的n个基础处理电路以及第1列的m个基础处理电路。如图1a所示的集成电路芯片装置,主处理电路和/或多个基础处理电路可以包括压缩映射电路,具体的多个基础处理电路中可以有部分基础处理电路包括压缩映射电路,例如,在一个可选的技术方案中,可以将k个基础处理电路配置压缩映射电路,这样n个基础处理电路可以分别负责对本列的m个基础处理电路的数据进行数据压缩处理步骤。此设置能够提高运算效率,降低功耗,因为对于第1行的n个基础处理电路来说,由于其最先接收到主处理电路发送的数据,那么将该接收到的数据进行压缩处理可以减少后续基础处理电路的计算量以及与后续基础处理电路的数据传输的量,同理,对于第一列的m个基础处理电路配置压缩映射电路也具有计算量小和功耗低的优点。另外,依据该结构,主处理电路可以采用动态的数据发送策略,例如,主处理电路向第1列的m个基础处理电路广播数据,主处理电路向第1行的n个基础处理电路发送分发数据,此优点是,通过不同的数据输入口传递不同的数据到基础处理电路内,这样基础处理电路可以不区分该接收到的数据是何种数据,只需要确定该数据从哪个接收端口接收即可以获知其属于何种数据。
所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述基础处理电路传输数据;上述连续的运算但不限于:累加运算、ALU运算、激活运算等等运算。
所述多个基础处理电路,用于依据传输的数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。上述并行方式执行神经网络中的运算包括但不限于:内积运算、矩阵或向量乘法运算等等。具体的,所述多个基础处理电路可先对传输的数据进行压缩处理,然后再依据压缩处理后的数据以并行方式执行神经网络中的运算。
主处理电路可以包括:数据发送电路、数据接收电路或接口,该数据发送电路可以集成数据分发电路以及数据广播电路,当然在实际应用中,数据分发电路以及数据广播电路也可以分别设置。对于广播数据,即需要发送给每个基础处理电路的数据。对于分发数据,即需要有选择的发送给部分基础处理电路的数据,具体的,如卷积运算,卷积运算的卷积输入数据需要发送给所有的基础处理电路,所以该卷积输入数据为广播数据,卷积核需要有选择的发送给部分基础数据块,所以卷积核为分发数据。分发数据具体的选择发送给哪个基础处理电路的方式可以由主处理电路依据负载以及其他分配方式进行具体的确定。对于广播发送方式,即将广播数据以广播形式发送至每个基础处理电路。(在实际应用中,通过一次广播的方式将广播数据发送至每个基础处理电路,也可以通过多次广播的方式将广播数据发送至每个基础处理电路,本披露具体实施方式并不限制上述广播的次数),对于分发发送方式,即将分发数据有选择的发送给部分基础处理电路。
可选的,对于第m行n个基础处理电路的累加器电路可以执行内积运算的累加运算,因为对于第m行基础处理电路来说,其能够接收到本列所有的基础处理电路的乘积结果,而通过第m行的n个基础处理电路执行内积运算的累加运算,这样能够对计算资源进行有效的分配,具有节省功耗的优点。此技术方案尤其对于m数量较大时更为适用。
对于数据的压缩处理可以由主处理电路来分配执行的电路,具体的,可以通过显示或隐式的方式来分配执行的电路,对于显示方式,主处理电路可以配置一个特殊指示或指令,当基础处理电路接收到该特殊指示或指令时,确定执行数据压缩处理,如基础处理电路未接收到特殊指示或指令时,确定不执行数据压缩处理。又如,可以以暗示的方式来执行,例如,基础处理电路接收到数据类型为稀疏数据(即含0,或包括小于预设阈值的数据大于预设数量)且确定需要执行内积运算时,将对稀疏数据进行压缩处理。对于显示配置的方式,特殊指令或指示可以配置一个递减序列,该递减序列每经过一个基础处理电路,数值减1,基础处理电路读取该递减序列的值,如该值大于零,则执行数据压缩处理,如该值等于或小于零,则不执行数据压缩处理。此设置是依据阵列分配的基础处理电路所配置的,例如对于第i列的m个基础处理电路来说,主处理电路需要前面5个基础处理电路执行属于压缩处理,则主处理电路下发一个特殊指令,该特殊指令包含有递减序列,该递减序列的初始值可以为5,则每经过一个基础处理电路,递减序列的值即减1,到第5个基础处理电路时,该递减序列的值为1,到第6个基础处理电路时,该递减序列为0,此时第6个基础处理电路将不再执行该数据压缩处理,此种方式可以使得主处理电路可以动态的配置数据压缩处理的执行主体以及执行次数。
本披露一个实施例提供一种集成电路芯片装置,包括一个主处理电路(也可以称为主单元)和多个基础处理电路(也可以称为基础单元);实施例的结构如图4b所示;其中,虚线框中是所述神经网络运算装置的内部结构;灰色填充的箭头表示主处理电路和基础处理电路阵列之间的数据传输通路,空心箭头表示基础处理电路阵列中各个基础处理电路(相邻基础处理电路)之间的数据传输通路。其中,基础处理电路阵列的长度宽度可以不同,即m、n的取值可以不同,当然也可以相同,本披露并不限制上述取值的具体值。
基础处理电路的电路结构如图4c所示;图中虚线框表示基础处理电路的边界,与虚线框交叉的粗箭头表示数据输入输出通道(指向虚线框内是输入通道,指出虚线框是输出通道);虚线框中的矩形框表示存储单元电路(寄存器和/或片上缓存),包括输入数据1,输入数据2,乘法或内积结果,累加数据;菱形框表示运算器电路,包括乘法或内积运算器,加法器。
本实施例中,所述神经网络运算装置包括一个主处理电路和16个基础处理电路(16个基础处理电路仅仅为了举例说明,在实际应用中,可以采用其他的数值);
本实施例中,基础处理电路有两个数据输入接口,两个数据输出接口;在本例的后续描述中,将横向的输入接口(图4b中指向本单元的横向箭头)称作输入0,竖向的输入接口(图4b中指向本单元的竖向箭头)称作输入1;将每一个横向的数据输出接口(图4b中从本单元指出的横向箭头)称作输出0,竖向的数据输出接口(图4b中从本单元指出的竖向箭头)称作输出1。
每一个基础处理电路的数据输入接口和数据输出接口可以分别连接不同的单元,包括主处理电路与其他基础处理电路;
本例中,基础处理电路0,4,8,12(编号见图4b)这四个基础处理电路的输入0与主处理电路的数据输出接口连接;
本例中,基础处理电路0,1,2,3这四个基础处理电路的输入1与主处理电路的数据输出 接口连接;
本例中,基础处理电路12,13,14,15这四个基础处理电路的输出1与主处理电路的数据输入接口相连;
本例中,基础处理电路输出接口与其他基础处理电路输入接口相连接的情况见图1b所示,不再一一列举;
具体地,S单元的输出接口S1与P单元的输入接口P1相连接,表示P单元将可以从其P1接口接收到S单元发送到其S1接口的数据。
本实施例包含一个主处理电路,主处理电路与外部装置相连接(即有输入接口也有输出接口),主处理电路的一部分数据输出接口与一部分基础处理电路的数据输入接口相连接;主处理电路的一部分数据输入接口与一部分基础处理电路的数据输出接口相连。
集成电路芯片装置的使用方法
本披露提供的使用方法中所涉及到的数据可以是经过压缩处理后的数据。关于如何实现数据的压缩处理具体可参见前述实施例中的相关阐述,例如图1e~图1k,这里不再赘述。
需要在基础处理电路中完成的运算,可以使用下述方法进行:
主处理电路的控制电路可将数据分发给基础处理电路运算。相应地,基础处理电路的压缩映射电路先对数据进行压缩处理后再运算,其优点是可以减少数据计算量,基础处理电路执行数据运算的效率也更高,功耗更低)
如基础处理电路接收到的数据为稀疏数据,那么基础处理电路可以收到数据后由压缩映射电路对数据进行压缩处理然后再进行计算,例如,基础处理电路收到主处理电路传输过来的稀疏数据,压缩映射电路将其进行压缩处理,然后基础处理电路的内积运算器电路、向量运算器电路或累加器电路对压缩处理后的数据进行运算,提高运算效率,降低功耗。
基础处理电路的使用方法(如图5a);
主处理电路从装置外部接收待计算的输入数据;
可选地,主处理电路利用本单元的各种运算电路,向量运算电路,内积运算器电路、累加器电路等对数据进行运算处理;
主处理电路通过数据输出接口向基础处理电路阵列(把所有基础处理电路的集合称作基础处理电路阵列)发送数据(如图5b所示);
此处的发送数据的方式可以是向一部分基础处理电路直接发送数据,即多次广播方式;
此处发送数据的方式可以向不同的基础处理电路分别发送不同的数据,即分发方式;
基础处理电路阵列对数据进行计算;
基础处理电路接收到输入数据后进行运算;可选的,基础处理电路可根据该数据的运算指令确定是否启动所述基础处理电路中的压缩映射单元对数据进行压缩处理,然后对压缩处理后的数据进行运算。
可选地,基础处理电路接收到数据后将该数据从本单元的数据输出接口传输出去;(传输给其他没有直接从主处理电路接收到数据的基础处理电路,可选的,该数据也可为压缩处理后的数据。)
可选地,基础处理电路将运算结果从数据输出接口传输出去;(中间计算结果或者最终计算结果)
主处理电路接收到从基础处理电路阵列返回的输出数据;
可选地,主处理电路对从基础处理电路阵列接收到的数据继续进行处理(例如累加或激活操作);
主处理电路处理完毕,将处理结果从数据输出接口传输给装置外部。
使用所述电路装置完成矩阵乘向量运算;
(矩阵乘向量可以是矩阵中的每一行分别与向量进行内积运算,并将这些结果按对应行的顺序摆放成一个向量。)
下面描述计算尺寸是M行L列的矩阵S和长度是L的向量P的乘法的运算,如下图5c所示。
此方法用到所述神经网络计算装置的全部或者一部分基础处理电路,假设用到了K个基础处理电路;
主处理电路将矩阵S的部分或全部行中的数据发送到k个基础处理电路中的每个基础处理电路;
在一种可选的方案中,主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于每次发送一个数,可以为对于某一个基础处理电路,第1次发送第3行第1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者对于每次发送一部分数,第1次发送第3行前两个数(即第1、2个数),第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)
在一种可选的方案中,主处理电路的控制电路将矩阵S中某几行的数据每次各发送一个数者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……。)
主处理电路的控制电路将向量P中的数据逐次发送到第0个基础处理电路;
第0个基础处理电路接收到向量P的数据之后,将该数据发送给与其相连接的下一个基础处理电路,即基础处理电路1;
具体的,有些基础处理电路不能直接从主处理电路处获得计算所需的所有的数据,例如,图2d中的基础处理电路1,只有一个数据输入接口与主处理电路相连,所以只能直接从主处理电路获得矩阵S的数据,而向量P的数据就需要依靠基础处理电路0输出给基础处理电路1,同理,基础处理电路1也要收到数据后继续把向量P的数据输出给基础处理电路2。
可选地,在所述k个基础处理电路中的每个基础处理电路接收到数据后,可先根据该数据的运算指令(即运算控制)确定是否启动对应的压缩映射电路来对该数据进行压缩处理,然后再利用压缩处理后的数据进行运算;可选的,还可将压缩处理后的数据传输给其他基础处理单元。
例如,基础处理电路在接收到输入矩阵S或矩阵P后,启用压缩映射电路将输入矩阵S和矩阵P中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S和矩阵P各自对应的mask矩阵来剔除,例如剔除mask矩阵为0 所对应的相同位置在矩阵S/P中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经元矩阵)和权值(也可称为权值矩阵)等。
每一个基础处理电路对接收到的数据进行运算,该运算包括但不限于:内积运算、乘法运算、加法运算等等;
在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;
在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;
基础处理电路计算出结果后,将结果从数据输出接口传输出去(即传输给与其连接的其他基础处理电路);
在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;
基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;
主处理电路接收到各个基础处理电路内积运算的结果,将该结果处理得到最终结果(该处理可以为累加运算或激活运算等等)。
采用上述计算装置实现矩阵乘向量方法的实施例:
在一种可选方案中,方法所用到的多个基础处理电路按照如下图5d或者图5e所示的方式排列;
如图4c所示,主处理单元的控制电路将矩阵S的M行数据分成K组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Ai)的运算;具体的,第i个基础处理电路在负责第i组(该组数据中行的集合记为Ai)的运算之前,可根据数据的运算指令确定是否需要先利用压缩映射电路对Ai进行压缩处理,然后再对压缩处理后的Ai执行运算。或者,针对装置中第1列或第1行中的各个基础处理单元在负责第i组(该组数据中行的集合记为Ai)的运算之前,可根据数据的运算指令确定是否需要先利用压缩映射电路对Ai进行压缩处理,然后再对压缩处理后的Ai执行运算,本申请不做限定。关于数据压缩处理具体可参见前述实施例中的相关阐述,这里不再赘述。
此处对M行数据进行分组的方法是任意不会重复分配的分组方式;
在一种可选方案中,采用如下分配方式:将第j行分给第j%K(%为取余数运算)个基础处理电路;
在一种可选方案中,对于不能平均分组的情况也可以先对一部分行平均分配,对于剩下的行以任意方式分配。
主处理电路的控制电路每次将矩阵S中部分或全部行中的数据依次发送给对应的基础处理电路;
在一种可选方案中,主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的一行数据中的一个或多个数据;
在一种可选方案中,主处理电路的控制电路每次向第i个基础处理电路发送其负责的第i组数据Mi中的部分或全部行中的每行的一个或多个数据;
主处理电路的控制电路将向量P中的数据依次向第1个基础处理电路发送;
在一种可选方案中,主处理电路的的控制电路每次可以发送向量P中的一个或多个数据;
第i个基础处理电路接收到向量P的数据之后发送给与其相连的第i+1个基础处理电路;可选的,发送的向量P的数据可为压缩处理后的数据。
每个基础处理电路接收到来自矩阵S中某一行或者某几行中的一个或多个数据以及来自向量P的一个或多个数据后,进行运算(包括但不限于乘法或加法);
在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;
在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;
在一种可选方案中,基础处理电路接收到的数据也可以是中间结果,保存在寄存器和或片上缓存上;
基础处理电路将本地的计算结果传输给与其相连接的下一个基础处理电路或者主处理电路;
在一种可选方案中,对应于图5d的结构,只有每列的最后一个基础处理电路的输出接口与主处理电路相连接的,这种情况下,只有最后一个基础处理电路可以直接将本地的计算结果传输给主处理电路,其他基础处理电路的计算结果都要传递给自己的下一个基础处理电路,下一个基础处理电路传递给下下个基础处理电路直至全部传输给最后一个基础处理电路,最后一个基础处理电路将本地的计算结果以及接收到的本列的其他基础处理电路的结果执行累加计算得到中间结果,将中间结果发送至主处理电路;当然还可以是:最后一个基础处理电路将本列的其他基础电路的结果以及本地的处理结果直接发送给主处理电路。
在一种可选方案中,对应于图5e的结构,每一个基础处理电路都有与主处理电路相连接的输出接口,这种情况下,每一个基础处理电路都直接将本地的计算结果传输给主处理电路;
基础处理电路接收到其他基础处理电路传递过来的计算结果之后,传输给与其相连接的下一个基础处理电路或者主处理电路。
主处理电路接收到M个内积运算的结果,作为矩阵乘向量的运算结果。
使用所述电路装置完成矩阵乘矩阵运算;
下面描述计算尺寸是M行L列的矩阵S和尺寸是L行N列的矩阵P的乘法的运算,(矩阵S中的每一行与矩阵P的每一列长度相同,如图5f所示)
本方法使用所述装置如图4b所示的实施例进行说明;
主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到通过横向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图4b中最上方的灰色填充的竖向数据通路);
在一种可选方案中,主处理电路的控制电路将矩阵S中某行的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3行第 1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者第1次发送第3行前两个数,第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)
在一种可选方案中,主处理电路的控制电路将矩阵S中某几行的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……;)
主处理电路的控制电路将矩阵P中的部分或全部列中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图4b中基础处理电路阵列左侧的灰色填充的横向数据通路);
在一种可选方案中,主处理电路的控制电路将矩阵P中某列的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3列第1个数,第2次发送第3列数据中的第2个数,第3次发送第3列的第3个数……,或者第1次发送第3列前两个数,第二次发送第3列第3和第4个数,第三次发送第3列第5和第6个数……;)
在一种可选方案中,主处理电路的控制电路将矩阵P中某几列的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5列每列的第1个数,第2次发送第3,4,5列每列的第2个数,第3次发送第3,4,5列每列的第3个数……,或者第1次发送第3,4,5列每列前两个数,第二次发送第3,4,5列每列第3和第4个数,第三次发送第3,4,5列每列第5和第6个数……;)
基础处理电路接收到矩阵S的数据之后,将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到矩阵P的数据后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的竖向的数据通路);
可选地,在每个基础处理电路均包括压缩映射电路时,基础处理电路在接收到数据(具体可为矩阵S或矩阵P的数据)后,可根据数据的运算控制确定启动压缩映射电路对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路;
例如,基础处理电路在接收到输入矩阵S或矩阵P后,启用压缩映射电路将输入矩阵S和矩阵P中数据为指定数值(如0)和/或数据小于预设阈值(如0.1)所对应的数据剔除,具体实现时可根据矩阵S和矩阵P各自对应的mask矩阵来剔除,例如剔除mask矩阵为0所对应的相同位置在矩阵S/P中的数据,具体可参见前述关于数据压缩处理实施例中的相关阐述,这里不再赘述。应理解的,这里的矩阵S和矩阵P也可对应理解为前述实施例中的输入神经元(也可称为输入神经元矩阵)和权值(也可称为权值矩阵)等。
可选的,在第1列和第1行中的各个基础处理电路中均包括压缩映射电路时,针对装置中第1列或第1行中的各个基础处理电路在接收到数据(具体可为矩阵S或矩阵P的数 据)后,可根据该数据对应的运算控制确定是否需要启用各自基础处理电路中的压缩映射电路对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路;可选的,针对装置中第1列或第1行中的各个基础处理电路在接收到数据后可直接启动其内的压缩映射电路对数据进行压缩处理,接着进行后续操作,例如发送给其他基础处理电路或对其进行运算等。
每一个基础处理电路对接收到的数据进行运算,可选的,该接收到的数据可为压缩处理后的数据。
在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;
在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;
基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;
在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;
具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,图4b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。
基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;
向着能够直接向主处理电路输出的方向输出结果(例如,图4b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);
主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。
“矩阵乘矩阵”方法的实施例:
方法用到按照如图4b所示方式排列的基础处理电路阵列,假设有h行,w列;
主处理电路的控制电路将矩阵S的h行数据分成h组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Hi)的运算;
此处对h行数据进行分组的方法是任意不会重复分配的分组方式;
在一种可选方案中,采用如下分配方式:主处理电路的控制电路将第j行分给第j%h个基础处理电路;
在一种可选方案中,对于不能平均分组的情况也可以先对一部分行平均分配,对于剩下的行以任意方式分配。
主处理电路的控制电路将矩阵P的W列数据分成w组,分别由第i个基础处理电路负责第i组(该组数据中行的集合记为Wi)的运算;
此处对W列数据进行分组的方法是任意不会重复分配的分组方式;
在一种可选方案中,采用如下分配方式:主处理电路的控制电路将第j行分给第j%w个基础处理电路;
在一种可选方案中,对于不能平均分组的情况也可以先对一部分列平均分配,对于剩 下的列以任意方式分配。
主处理电路的控制电路将矩阵S的部分或全部行中的数据发送到基础处理电路阵列中每行的第一个基础处理电路;
在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的一行数据中的一个或多个数据;
在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i行的第一个基础处理电路发送其负责的第i组数据Hi中的部分或全部行中的每行的一个或多个数据;
主处理电路的控制电路将矩阵P的部分或全部列中的数据发送到基础处理电路阵列中每列的第一个基础处理电路;
在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Wi中的一列数据中的一个或多个数据;
在一种可选方案中,主处理电路的控制电路每次向基础处理电路阵列中第i列的第一个基础处理电路发送其负责的第i组数据Ni中的部分或全部列中的每列的一个或多个数据;
基础处理电路接收到矩阵S的数据之后,将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到矩阵P的数据后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的竖向的数据通路);
可选地,在每个基础处理电路均包括压缩映射电路时,基础处理电路在接收到数据(具体可为矩阵S或矩阵P的数据)后,可根据数据的运算控制确定启动压缩映射电路对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路;关于数据的压缩处理可参见前述实施例中的相关阐述,这里不再赘述。
可选的,在第1列和第1行中的各个基础处理电路中均包括压缩映射电路时,针对装置中第1列或第1行中的各个基础处理单元在接收到数据(具体可为矩阵S或矩阵P的数据)后,可对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路。具体可参见前述实施例中的相关阐述。
每一个基础处理电路对接收到的数据进行运算,可选的,该接收到的数据可为压缩处理后的数据;
在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和或片上缓存上;
在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和或片上缓存上;
基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;
在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;
具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输 结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。
基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;
向着能够直接向主处理电路输出的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);
主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。
以上描述中使用的“横向”,“竖向”等词语只是为了表述图4b所示的例子,实际使用只需要区分出每个单元的“横向”“竖向”接口代表两个不同的接口即可。
使用所述电路装置完成全连接运算:
如果全连接层的输入数据是一个向量(即神经网络的输入是单个样本的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为向量P,按照所述装置采用矩阵乘以向量方法执行运算;
如果全连接层的输入数据是一个矩阵(即神经网络的输入是多个样本的情况),则以全连接层的权值矩阵作为矩阵S,输入向量作为矩阵P,或者以全连接层的权值矩阵作为矩阵P,输入向量作为矩阵S,按照所述装置的矩阵乘以矩阵执行运算;
使用所述电路装置完成卷积运算:
下面描述卷积运算,下面的图中一个方块表示一个数据,输入数据用图6a表示(N个样本,每个样本有C个通道,每个通道的特征图的高为H,宽为W),权值也即卷积核用图6b表示(有M个卷积核,每个卷积核有C个通道,高和宽分别为KH和KW)。对于输入数据的N个样本,卷积运算的规则都是一样的,下面解释在一个样本上进行卷积运算的过程,在一个样本上,M个卷积核中的每一个都要进行同样的运算,每个卷积核运算得到一张平面特征图,M个卷积核最终计算得到M个平面特征图,(对一个样本,卷积的输出是M个特征图),对于一个卷积核,要在一个样本的每一个平面位置进行内积运算,然后沿着H和W方向进行滑动,例如,图6c表示一个卷积核在输入数据的一个样本中右下角的位置进行内积运算的对应图;图6d表示卷积的位置向左滑动一格和图6e表示卷积的位置向上滑动一格。
本方法使用所述装置如图4b所示的实施例进行说明;
主处理电路的控制电路将权值的部分或全部卷积核中的数据发送到通过横向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图4b中最上方的灰色填充的竖向数据通路);
在一种可选方案中,主处理电路的控制电路将权值中某个卷积核的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3行第1个数,第2次发送第3行数据中的第2个数,第3次发送第3行的第3个数……,或者第1次发送第3行前两个数,第二次发送第3行第3和第4个数,第三次发送第3行第5和第6个数……;)
在一种可选方案中另一种情况是,主处理电路的控制电路将权值中某几个卷积核的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5行每行的第1个数,第2次发送第3,4,5行每行的第2个数,第3次发送第3,4,5行每行的第3个数……,或者第1次发送第3,4,5行每行前两个数,第二次发送第3,4,5行每行第3和第4个数,第三次发送第3,4,5行每行第5和第6个数……;)
主处理电路的控制电路把输入数据按照卷积的位置进行划分,主处理电路的控制电路将输入数据中的部分或全部卷积位置中的数据发送到通过竖向数据输入接口直接与主处理电路相连的那些基础处理电路(例如,图4b中基础处理电路阵列左侧的灰色填充的横向数据通路);
在一种可选方案中,主处理电路的控制电路将输入数据中某个卷积位置的数据每次发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3列第1个数,第2次发送第3列数据中的第2个数,第3次发送第3列的第3个数……,或者第1次发送第3列前两个数,第二次发送第3列第3和第4个数,第三次发送第3列第5和第6个数……;)
在一种可选方案中另一种情况是,主处理电路的控制电路将输入数据中某几个卷积位置的数据每次各发送一个数或者一部分数给某个基础处理电路;(例如,对于某一个基础处理电路,第1次发送第3,4,5列每列的第1个数,第2次发送第3,4,5列每列的第2个数,第3次发送第3,4,5列每列的第3个数……,或者第1次发送第3,4,5列每列前两个数,第二次发送第3,4,5列每列第3和第4个数,第三次发送第3,4,5列每列第5和第6个数……;)
基础处理电路接收到权值的数据之后,将该数据通过其横向的数据输出接口传输给其相连接下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的横向的数据通路);基础处理电路接收到输入数据的数据后,将该数据通过其竖向的数据输出接口传输给与其相连接的下一个基础处理电路(例如,图4b中基础处理电路阵列中间的白色填充的竖向的数据通路);可选的,基础处理电路在接收到数据(具体可为权值的部分或者全部卷积核中的数据)后,可根据数据的运算控制确定启动压缩映射电路对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路;具体可参见前述实施例中的相关阐述。
或者,针对装置中第1列或第1行中的各个基础处理单元在接收到数据(具体可为权值的部分或者全部卷积核中的数据)后,可对数据进行压缩处理;进一步地,可将压缩处理后的数据通过其横向或竖向的数据输出接口传输给其相连接下一个基础处理电路;具体可参见前述实施例中的相关阐述。
每一个基础处理电路对接收到的数据进行运算,该接收到的数据可为压缩处理后的数据;
在一种可选方案中,基础处理电路每次计算一组或多组两个数据的乘法,然后将结果累加到寄存器和/或片上缓存上;
在一种可选方案中,基础处理电路每次计算一组或多组两个向量的内积,然后将结果累加到寄存器和/或片上缓存上;
基础处理电路计算出结果后,可以将结果从数据输出接口传输出去;
在一种可选方案中,该计算结果可以是内积运算的最终结果或中间结果;具体地,如果该基础处理电路有直接与主处理电路相连接的输出接口则从该接口传输结果,如果没有,则向着能够直接向主处理电路输出的基础处理电路的方向输出结果(例如,图4b中,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果)。
基础处理电路接收到来自其他基础处理电路的计算结果之后,将该数据传输给与其相连接的其他基础处理电路或者主处理电路;
向着能够直接向主处理电路输出的方向输出结果(例如,最下面一行基础处理电路将其输出结果直接输出给主处理电路,其他基础处理电路从竖向的输出接口向下传输运算结果);
主处理电路接收到各个基础处理电路内积运算的结果,即可得到输出结果。
在一个实施例中,本申请公开了一种神经网络运算装置,其包括用于执行如上所述方法实施例中提供的所有或部分实施方式所对应的功能单元。
在一个实施例里,本申请公开了一种芯片(如图7),用于执行如上所述方法实施例中提供的所有或部分实施方式。
在一个实施例里,本申请公开了一种电子装置,其包括用于执行如上所述方法实施例中的所有或部分实施方式的功能单元。
电子装置包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
Claims (23)
- 一种集成电路芯片装置,其特征在于,所述集成电路芯片装置包括:主处理电路以及k组基础处理电路,所述主处理电路与所述k组基础处理电路连接,所述一组基础处理电路包括至少一个基础处理电路;所述主处理电路,用于执行神经网络运算中的各个连续的运算以及向所述多个基础处理电路传输数据;所述k组基础处理电路,用于依据所述传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
- 根据权利要求1所述的集成电路芯片装置,其特征在于,所述集成电路芯片装置还包括:k个分支电路,所述主处理电路与所述k个分支电路分别连接,所述k个分支电路中每个分支电路对应k组基础处理电路中的一组基础处理电路,用于在所述主处理电路与所述k组基础处理电路之间转发传输数据。
- 根据权利要求1所述的集成电路芯片装置,其特征在于,所述基础处理电路包括压缩映射电路;所述压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;所述k组基础处理电路,具体用于依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;依据所述传输数据或压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果传输给所述主处理电路。
- 根据权利要求3所述的集成电路芯片装置,其特征在于,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与其连接的电路,将所述广播数据块广播至与其连接的电路;所述基础处理电路,用于依据所述运算控制启动所述压缩映射电路对所述基本数据块与所述广播数据块进行压缩处理后再执行内积运算得到运算结果,将所述运算结果发送至主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述待计算的数据块为待计算的至少一个输入神经元,和/或,至少一个权值。
- 根据权利要求2所述的集成电路芯片装置,其特征在于,所述分支电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与其相连的所述k个分支电路传输数据;所述k个分支电路,用于在主处理电路与k组基础电路之间转发所述传输数据,依据所述传输数据的运算控制是否启动所述压缩映射电路对所述传输数据进行压缩处理;所述k个基础处理电路,用于依据所述传输数据或压缩处理后的传输数据以并行方式 执行神经网络中的运算,并将运算结果传输给所述主处理电路。
- 根据权利要求5所述的集成电路芯片装置,其特征在于,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至与其连接的所述k个分支电路,将所述广播数据块广播至与其连接的所述k个分支电路;所述k个分支电路,用于接收基本数据块以及广播数据块,启动压缩映射电路将该基本数据块以及广播数据块进行压缩处理;将压缩处理后的基本数据块以及压缩处理后的广播数据块转发至k组基础处理电路;所述基础处理电路,用于对所述压缩处理后的基本数据块与所述压缩处理后的广播数据块执行内积运算得到运算结果,将所述运算结果发送至所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
- 根据权利要求4或6所述的集成电路芯片装置,其特征在于,所述压缩映射电路包括第二稀疏处理单元、第三稀疏处理单元以及连接关系处理单元;所述第二稀疏处理单元,用于接收到第三输入数据后,根据所述第三输入数据得到第一连接关系数据,并将该第一关系数据传输至连接关系处理单元;所述第三稀疏处理单元,用于接收到第四输入数据后,根据所述第四输入数据得到第二连接关系数据,并将该第二关系数据传输至所述连接关系处理单元;所述连接关系处理单元,用于根据所述第一连接关系数据和所述第二连接关系数据,以得到第三连接关系数据,并将该第三连接关系数据传输至第二数据处理单元;所述第二数据处理单元,用于在接收到所述第三输入数据,所述第四输入数据和所述第三连接关系数据后,根据所述第三连接关系数据对所述第三输入数据和所述第四输入数据进行压缩处理,以得到第四输出数据和第五输出数据;其中,当所述第三输入数据包括至少一个输入神经元,第四输入数据包括至少一个权值时,所述第一连接关系数据为输入神经元的连接关系数据,所述第二连接关系数据为权值的连接关系数据,所述第四输出数据为处理后的输入神经元,所述第五输出数据为处理后的权值;当所述第三输入数据包括至少一个权值,所述第四输入数据包括至少一个输入神经元时,所述第一连接关系数据为权值的连接关系数据,所述第二连接关系数据为输入神经元的连接关系数据,所述第四输出数据为处理后的权值,所述第五输出数据为处理后的输入神经元。
- 根据权利要求7所述的集成电路芯片装置,其特征在于,所述神经元的连接关系数据以及所述权值的连接关系数据均为由0和1组成的字符串或矩阵组成,且与输出神经元无关;或者,所述输入神经元的连接关系数据和所述权值的连接关系数据均以直接索引或者步长索引的形式表示;其中,当所述输入神经元的连接关系数据以直接索引的形式表示时,该连接关系数据为由0和1组成的字符串,0表示所述输入神经元的绝对值小于或者等于第一阈值,1表示所述输入神经元的绝对值大于所述第一阈值;当所述输入神经元的连接关系数据以步长索引形式表示时,该连接关系数据为绝对值大于所述第一阈值的输入神经元与上一个绝对值大于所述第一阈值的输入神经元之间的距离值组成的字符串;当所述权值的连接关系数据以直接索引的形式表示时,该连接关系数据为由0和1组成的字符串,0表示所述权值的绝对值小于或者等于第二阈值,即该权值对应的输入神经元与输出神经元之间没有连接,1表示所述权值的绝对值大于所述第二阈值,即该权值对应的输入神经元与输出神经元之间有连接;以直接索引形式表示权值的连接关系数据有两种表示顺序:以每个输出神经元与所有输入神经元的连接状态组成一个0和1的字符串来表示所述权值的连接关系数据;或者每个输入神经元与所有输出神经元的连接状态组成一个0和1的字符串来表示所述权值的连接关系数据;当所述权值的连接关系数据以步长索引的形式表示时,该连接关系数据为与输出神经元有连接的输入神经元与上一个与该输出神经元有连接的输入神经元之间的距离值组成的字符串。
- 根据权利要求8所述的集成电路芯片装置,其特征在于,当所述第一连接关系数据和所述第二连接关系数据均以步长索引的形式表示,且表示所述第一连接关系数据和所述第二连接关系数据的字符串是按照物理地址由低到高的顺序存储时,所述连接关系处理单元具体用于:将所述第一连接关系数据的字符串中的每一个元素与存储物理地址低于该元素存储的物理地址的元素进行累加,得到的新的元素组成第四连接关系数据;同理,对所述第二连接关系数据的字符串进行同样的处理,得到第五连接关系数据;从所述第四连接关系数据的字符串和所述第五连接关系数据的字符串中,选取相同的元素,按照元素值从小到大的顺序排序,组成新的字符串;将所述新的字符串中每一个元素与其相邻的且值小于该元素值的元素进行相减,得到的元素组成所述第三连接关系数据。
- 根据权利要求8所述的集成电路芯片装置,其特征在于,当所述第一连接关系数据和所述第二连接关系数据均以直接索引的形式表示时,所述连接关系处理单元具体用于:对所述第一连接关系数据和所述第二连接关系数据进行与操作,以得到第三连接关系数据。
- 根据权利要求8所述的集成电路芯片装置,其特征在于,当所述第一连接关系数据与所述第二连接关系数据中任意一个以步长索引的形式表示,另一个以直接索引的形式表示时,所述连接关系处理单元具体用于:若所述第一关系数据是以步长索引的形式表示,将所述第一连接关系数据转换成以直接索引的形式表示的连接关系数据;若所述第二关系数据是以步长索引的形式表示,将所述第二连接关系数据转换成以直接索引的形式表示的连接关系数据;对所述第一连接关系数据和所述第二连接关系数据进行与操作,以得到第三连接关系数据。
- 根据权利要求8所述的集成电路芯片装置,其特征在于,当所述第一连接关系数据与所述第二连接关系数据中任意一个以步长索引的形式表示,另一个以直接索引的形式表示,且表示所述第一连接关系数据和所述第二连接关系数据的字符串是按照物理地址由低到高的顺序存储时,所述连接关系处理单元还具体用于:若所述第一关系数据是以步长索引的形式表示,将所述第二连接关系数据转换成以步长索引的形式表示的连接关系数据;若所述第二关系数据是以步长索引的形式表示,将所述第一连接关系数据转换成以步长索引的形式表示的连接关系数据;将所述第一连接关系数据的字符串中的每一个元素与存储物理地址低于该元素存储的物理地址的元素进行累加,得到的新的元素组成第四连接关系数据;同理,对所述第二连接关系数据的字符串进行同样的处理,得到第五连接关系数据;从所述第四连接关系数据的字符串和所述第五连接关系数据的字符串中,选取相同的元素,按照元素值从小到大的顺序排序,组成新的字符串;将所述新的字符串中每一个元素与其相邻的且值小于该元素值的元素进行相减,得到的元素组成所述第三连接关系数据。
- 根据权利要求6所述的集成电路芯片装置,其特征在于,所述启动压缩映射电路将该基本数据块以及广播数据块进行压缩处理之前,还包括:所述K个分支电路,还用于通过所述压缩映射电路对所述至少一个输入神经元进行分组,以得到M组输入神经元,所述M为大于或者等于1的整数;判断所述M组输入神经元的每一组输入神经元是否满足第一预设条件,所述第一预设条件包括一组输入神经元中绝对值小于或者等于第三阈值的输入神经元的个数小于或者等于第四阈值;当所述M组输入神经元任意一组输入神经元不满足所述第一预设条件时,将该组输入神经元删除;对所述至少一个权值进行分组,以得到N组权值,所述N为大于或者等于1的整数;判断所述N组权值的每一组权值是否满足第二预设条件,所述第二预设条件包括一组权值中绝对值小于或者等于第五阈值的权值的个数小于或者等于第六阈值;当所述N组权值任意一组权值不满足所述第二预设条件时,将该组权值删除。
- 根据权利要求6所述的集成电路芯片装置,其特征在于,所述主处理电路,具体用于将所述广播数据块通过一次广播至所述k个分支电路;或者,所述主处理电路,具体用于将所述广播数据块分成多个部分广播数据块,将所述多个部分广播数据块通过多次广播至所述K个分支电路。
- 根据权利要求14所述的集成电路芯片装置,其特征在于,所述基础处理电路,具体用于将压缩处理后的所述部分广播数据块与压缩处理后的所述基本数据块执行一次内积处理后得到内积处理结果,将所述内积处理结果累加得到部分运算结果,将所述部分运算结果发送至所述主处理电路。
- 根据权利要求15所述的集成电路芯片装置,其特征在于,所述基础处理电路,具体用于复用n次该部分广播数据块执行该部分广播数据块与该n个基本数据块内积运算得到n个部分处理结果,将n个部分处理结果分别累加后得到n个部分运算结果,将所述n个部分运算结果发送至主处理电路,所述n为大于等于2的整数。
- 一种集成电路芯片装置,其特征在于,所述集成电路芯片装置包括:主处理电路以及多个基础处理电路;所述多个基础处理电路呈阵列分布;每个基础处理电路与相邻的其他基础处理电路连接,所述主处理电路连接所述多个基础处理电路中的k个基础处理电路,所述k个基础电路为:第1行的n个基础处理电路以及第1列的m个基础处理电路;所述多个基础处理电路中的部分或所有基础处理电路包括:压缩映射电路,用于执行神经网络运算中的各个数据的压缩处理;所述主处理电路,用于执行神经网络运算中的各个连续的运算以及和与所述k个基础处理电路传输数据;所述k个基础处理电路,用于在所述主处理电路以及多个基础处理电路之间的数据转发;所述部分或所有基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。
- 根据权利要求1所述的集成电路芯片装置,其特征在于,在所述多个基础处理电路均包括压缩映射电路时,所述多个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述k个基础处理电路传输给所述主处理电路。
- 根据权利要求18所述的集成电路芯片装置,其特征在于,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述k个基础处理电路,将所述广播数据块广播至所述k个基础处理电路;所述多个基础处理电路,用于依据接收到的基础数据块、广播数据块以及运算指令启动所述压缩映射电路将基础数据块和广播数据块进行压缩处理,对压缩处理后的所述基本数据块与压缩处理后的所述广播数据块执行内积运算得到运算结果,将运算结果通过所述k个基础处理电路传输给所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
- 根据权利要求1所述的集成电路芯片装置,其特征在于,在所述多个基础处理电路中的所述k个基础处理电路均包括压缩映射电路时,所述k个基础处理电路,用于依据传输数据的运算控制确定是否启动所述压缩映射电路对所述传输数据进行压缩处理,并将压缩处理后的传输数据发送给与所述k个基础处理电路连接的基础处理电路;所述多个基础处理电路,用于依据压缩处理后的传输数据以并行方式执行神经网络中的运算,并将运算结果通过与所述主处理电路连接的基础处理电路传输给所述主处理电路。
- 根据权利要求20所述的集成电路芯片装置,其特征在于,所述主处理电路,用于获取待计算的数据块以及运算指令,依据该运算指令对所述待计算的数据块划分成分发数据块以及广播数据块;对所述分发数据块进行拆分处理得到多个基本数据块,将所述多个基本数据块分发至所述k个基础处理电路,将所述广播数据块广播至所述k个基础处理电路;所述k个基础处理电路,用于依据接收到的基础数据块、广播数据块以及运算指令启动所述压缩映射电路将基础数据块和广播数据块进行压缩处理,然后传输给与所述k个基础处理电路连接的基础处理电路;所述多个基础处理电路,用于对压缩处理后的所述基本数据块与所述广播数据块执行内积运算得到运算结果,并将所述运算结果发送至所述主处理电路;所述主处理电路,用于对所述运算结果处理得到所述待计算的数据块以及运算指令的指令结果;其中,所述分发数据块以及所述广播数据块为至少一个输入神经元或者,至少一个权值。
- 一种芯片,其特征在于,所述芯片集成如权利要求1-16任意一项所述的装置,或者所述芯片集成如权利要求17-21任意一项所述的装置。
- 一种智能设备,其特征在于,所述智能设备包括如权利要求22所述的芯片。
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP18894430.0A EP3624019A4 (en) | 2017-12-30 | 2018-12-29 | CHIP DEVICE WITH INTEGRATED CIRCUIT AND ASSOCIATED PRODUCT |
US16/698,108 US11734548B2 (en) | 2017-12-30 | 2019-11-27 | Integrated circuit chip device and related product |
US16/698,000 US11704544B2 (en) | 2017-12-30 | 2019-11-27 | Integrated circuit chip device and related product |
US16/698,056 US11651202B2 (en) | 2017-12-30 | 2019-11-27 | Integrated circuit chip device and related product |
US16/698,164 US11710031B2 (en) | 2017-12-30 | 2019-11-27 | Parallel processing circuits for neural networks |
Applications Claiming Priority (8)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711499267.X | 2017-12-30 | ||
CN201711499266.5 | 2017-12-30 | ||
CN201711499268.4A CN109993292B (zh) | 2017-12-30 | 2017-12-30 | 集成电路芯片装置及相关产品 |
CN201711499267.XA CN109993291B (zh) | 2017-12-30 | 2017-12-30 | 集成电路芯片装置及相关产品 |
CN201711499265.0A CN109993289B (zh) | 2017-12-30 | 2017-12-30 | 集成电路芯片装置及相关产品 |
CN201711499265.0 | 2017-12-30 | ||
CN201711499266.5A CN109993290B (zh) | 2017-12-30 | 2017-12-30 | 集成电路芯片装置及相关产品 |
CN201711499268.4 | 2017-12-30 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/698,000 Continuation-In-Part US11704544B2 (en) | 2017-12-30 | 2019-11-27 | Integrated circuit chip device and related product |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019129302A1 true WO2019129302A1 (zh) | 2019-07-04 |
Family
ID=67063343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/125801 WO2019129302A1 (zh) | 2017-12-30 | 2018-12-29 | 集成电路芯片装置及相关产品 |
Country Status (3)
Country | Link |
---|---|
US (1) | US11704544B2 (zh) |
EP (1) | EP3624019A4 (zh) |
WO (1) | WO2019129302A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114341768A (zh) * | 2019-08-29 | 2022-04-12 | 美光科技公司 | 操作模式寄存器 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210406654A1 (en) * | 2020-06-29 | 2021-12-30 | Alibaba Group Holding Limited | Artificial neural network with sparse weights |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0631254A2 (en) * | 1993-06-14 | 1994-12-28 | Motorola, Inc. | Neural network and method of using same |
CN106126481A (zh) * | 2016-06-29 | 2016-11-16 | 华为技术有限公司 | 一种计算引擎和电子设备 |
CN106447034A (zh) * | 2016-10-27 | 2017-02-22 | 中国科学院计算技术研究所 | 一种基于数据压缩的神经网络处理器、设计方法、芯片 |
CN107220702A (zh) * | 2017-06-21 | 2017-09-29 | 北京图森未来科技有限公司 | 一种神经网络优化方法及装置 |
CN107229967A (zh) * | 2016-08-22 | 2017-10-03 | 北京深鉴智能科技有限公司 | 一种基于fpga实现稀疏化gru神经网络的硬件加速器及方法 |
Family Cites Families (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5041916A (en) | 1989-02-07 | 1991-08-20 | Matsushita Electric Industrial Co., Ltd. | Color image data compression and recovery apparatus based on neural networks |
US5627533A (en) * | 1994-08-05 | 1997-05-06 | Hayes Microcomputer Products, Inc. | Adjusting encoding table size and memory allocation for data compression in response to input data |
US20080222064A1 (en) * | 2007-03-08 | 2008-09-11 | Larimer Daniel J | Processes and Systems for Automated Collective Intelligence |
US8990133B1 (en) * | 2012-12-20 | 2015-03-24 | Brain Corporation | Apparatus and methods for state-dependent learning in spiking neuron networks |
US9256215B2 (en) * | 2012-07-27 | 2016-02-09 | Brain Corporation | Apparatus and methods for generalized state-dependent learning in spiking neuron networks |
US20150269482A1 (en) | 2014-03-24 | 2015-09-24 | Qualcomm Incorporated | Artificial neural network and perceptron learning using spiking neurons |
JP6324155B2 (ja) | 2014-03-27 | 2018-05-16 | キヤノン株式会社 | 画像処理装置、画像処理方法、及びプログラム |
US10223635B2 (en) | 2015-01-22 | 2019-03-05 | Qualcomm Incorporated | Model compression and fine-tuning |
US9904849B2 (en) | 2015-08-26 | 2018-02-27 | Digitalglobe, Inc. | System for simplified generation of systems for broad area geospatial object detection |
CN107563497B (zh) * | 2016-01-20 | 2021-03-19 | 中科寒武纪科技股份有限公司 | 用于稀疏人工神经网络的计算装置和运算方法 |
CN106991477B (zh) | 2016-01-20 | 2020-08-14 | 中科寒武纪科技股份有限公司 | 一种人工神经网络压缩编码装置和方法 |
JP6706326B2 (ja) | 2016-02-03 | 2020-06-03 | グーグル エルエルシー | リカレントニューラルネットワークモデルの圧縮 |
CN107315574B (zh) | 2016-04-26 | 2021-01-01 | 安徽寒武纪信息科技有限公司 | 一种用于执行矩阵乘运算的装置和方法 |
CN110188870B (zh) | 2016-04-27 | 2021-10-12 | 中科寒武纪科技股份有限公司 | 用于执行人工神经网络自学习运算的装置和方法 |
CN111860813B (zh) | 2016-04-29 | 2024-01-16 | 中科寒武纪科技股份有限公司 | 一种用于执行卷积神经网络正向运算的装置和方法 |
CN107239823A (zh) | 2016-08-12 | 2017-10-10 | 北京深鉴科技有限公司 | 一种用于实现稀疏神经网络的装置和方法 |
US10621486B2 (en) * | 2016-08-12 | 2020-04-14 | Beijing Deephi Intelligent Technology Co., Ltd. | Method for optimizing an artificial neural network (ANN) |
CN107239825B (zh) | 2016-08-22 | 2021-04-09 | 赛灵思电子科技(北京)有限公司 | 考虑负载均衡的深度神经网络压缩方法 |
CN107220706A (zh) | 2016-12-29 | 2017-09-29 | 恩泊泰(天津)科技有限公司 | 基于参数压缩和结构压缩的车载深度神经网络优化方法 |
CN107368885A (zh) | 2017-07-13 | 2017-11-21 | 北京智芯原动科技有限公司 | 基于多粒度剪枝的网络模型压缩方法及装置 |
CN107506722A (zh) | 2017-08-18 | 2017-12-22 | 中国地质大学(武汉) | 一种基于深度稀疏卷积神经网络人脸情感识别方法 |
-
2018
- 2018-12-29 WO PCT/CN2018/125801 patent/WO2019129302A1/zh unknown
- 2018-12-29 EP EP18894430.0A patent/EP3624019A4/en active Pending
-
2019
- 2019-11-27 US US16/698,000 patent/US11704544B2/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0631254A2 (en) * | 1993-06-14 | 1994-12-28 | Motorola, Inc. | Neural network and method of using same |
CN106126481A (zh) * | 2016-06-29 | 2016-11-16 | 华为技术有限公司 | 一种计算引擎和电子设备 |
CN107229967A (zh) * | 2016-08-22 | 2017-10-03 | 北京深鉴智能科技有限公司 | 一种基于fpga实现稀疏化gru神经网络的硬件加速器及方法 |
CN106447034A (zh) * | 2016-10-27 | 2017-02-22 | 中国科学院计算技术研究所 | 一种基于数据压缩的神经网络处理器、设计方法、芯片 |
CN107220702A (zh) * | 2017-06-21 | 2017-09-29 | 北京图森未来科技有限公司 | 一种神经网络优化方法及装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3624019A4 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114341768A (zh) * | 2019-08-29 | 2022-04-12 | 美光科技公司 | 操作模式寄存器 |
CN114341768B (zh) * | 2019-08-29 | 2023-03-31 | 美光科技公司 | 操作模式寄存器 |
Also Published As
Publication number | Publication date |
---|---|
US20200175357A1 (en) | 2020-06-04 |
EP3624019A4 (en) | 2021-03-24 |
EP3624019A1 (en) | 2020-03-18 |
US11704544B2 (en) | 2023-07-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110197270B (zh) | 集成电路芯片装置及相关产品 | |
TWI768167B (zh) | 集成電路芯片裝置及相關產品 | |
WO2019129302A1 (zh) | 集成电路芯片装置及相关产品 | |
US11651202B2 (en) | Integrated circuit chip device and related product | |
US11710031B2 (en) | Parallel processing circuits for neural networks | |
TWI768168B (zh) | 集成電路芯片裝置及相關產品 | |
CN111767998B (zh) | 集成电路芯片装置及相关产品 | |
CN110197274B (zh) | 集成电路芯片装置及相关产品 | |
TWI787430B (zh) | 積體電路晶片裝置、晶片、電子設備、及神經網絡的運算方法 | |
CN111767996B (zh) | 集成电路芯片装置及相关产品 | |
CN110197275B (zh) | 集成电路芯片装置及相关产品 | |
CN110197265B (zh) | 集成电路芯片装置及相关产品 | |
CN111767997B (zh) | 集成电路芯片装置及相关产品 | |
CN110197266B (zh) | 集成电路芯片装置及相关产品 | |
CN110197273B (zh) | 集成电路芯片装置及相关产品 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18894430 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2018894430 Country of ref document: EP Effective date: 20191209 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |