CN111767998B - Integrated circuit chip device and related products - Google Patents

Integrated circuit chip device and related products Download PDF

Info

Publication number
CN111767998B
CN111767998B CN202010617209.8A CN202010617209A CN111767998B CN 111767998 B CN111767998 B CN 111767998B CN 202010617209 A CN202010617209 A CN 202010617209A CN 111767998 B CN111767998 B CN 111767998B
Authority
CN
China
Prior art keywords
data block
processing circuit
basic
processed
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010617209.8A
Other languages
Chinese (zh)
Other versions
CN111767998A (en
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN202010617209.8A priority Critical patent/CN111767998B/en
Publication of CN111767998A publication Critical patent/CN111767998A/en
Application granted granted Critical
Publication of CN111767998B publication Critical patent/CN111767998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Optimization (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Logic Circuits (AREA)

Abstract

The present disclosure provides an integrated circuit chip device and related products, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in the neural network operation. The technical scheme provided by the disclosure has the advantages of small calculated amount and low power consumption.

Description

Integrated circuit chip device and related products
Technical Field
The present disclosure relates to the field of neural networks, and more particularly, to an integrated circuit chip device and related products.
Background
Artificial neural networks (ARTIFICIAL NEURAL NETWORK, ANN) are a growing research hotspot in the field of artificial intelligence since the 80 s of the 20 th century. The human brain nerve cell network is abstracted from the information processing perspective, a certain simple model is built, and different networks are formed according to different connection modes. Also commonly referred to in engineering and academia as neural networks or neural-like networks. A neural network is an operational model, which is formed by interconnecting a large number of nodes (or neurons). The operation of the existing neural network is realized based on a CPU (Central Processing Unit, a central processing unit) or a GPU (English: graphics Processing Unit, a graphic processor), and the calculation amount of the operation is large and the power consumption is high.
Disclosure of Invention
The embodiment of the disclosure provides an integrated circuit chip device and related products, which can improve the processing speed of a computing device and improve the efficiency.
In a first aspect, there is provided an integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit (namely, part or all of the basic processing circuits) of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in the neural network operation;
The main processing circuit is used for acquiring an input data block, a weight data block and a multiplication instruction, dividing the input data block into divided data blocks according to the multiplication instruction, and dividing the weight data block into broadcast data blocks; determining to start a first mapping circuit to process the first data block according to the operation control of the multiplication instruction, and obtaining a processed first data block; the first data block comprises the distribution data block and/or the broadcast data block; transmitting the processed first data block to at least one basic processing circuit in basic processing circuits connected with the main processing circuit according to the multiplication instruction;
The basic processing circuits are used for determining whether to start the second mapping circuit to process the second data block according to the operation control of the multiplication instruction, executing the operation in the neural network in a parallel mode according to the processed second data block to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit; the second data block is determined by the basic processing circuit and used for receiving the data block sent by the main processing circuit, and the second data block is associated with the processed first data block;
and the main processing circuit is used for processing the operation result to obtain the instruction result of the multiplication instruction.
In a second aspect, a neural network computing device is provided, the neural network computing device comprising one or more of the integrated circuit chip devices provided in the first aspect.
In a third aspect, there is provided a combination processing apparatus including: the neural network operation device, the universal interconnection interface and the universal processing device provided in the second aspect;
The neural network operation device is connected with the general processing device through the general interconnection interface.
In a fourth aspect, there is provided a chip integrating the apparatus of the first aspect, the apparatus of the second aspect or the apparatus of the third aspect.
In a fifth aspect, an electronic device is provided, the electronic device comprising the chip of the fourth aspect.
In a sixth aspect, there is provided a method of operating a neural network, the method being applied within an integrated circuit chip device, the integrated circuit chip device comprising: the integrated circuit chip device of the first aspect for performing operations of a neural network.
It can be seen that, according to the embodiment of the disclosure, the mapping circuit is provided to compress the data block and then perform the operation, so that transmission resources and calculation resources are saved, and therefore, the mapping circuit has the advantages of low power consumption and small calculation amount.
Drawings
FIG. 1a is a schematic diagram of an integrated circuit chip device.
FIG. 1b is a schematic diagram of another integrated circuit chip device.
FIG. 1c is a schematic diagram of a basic processing circuit.
Fig. 2 is a schematic diagram of a matrix-by-vector flow.
Fig. 2a is a schematic diagram of a matrix multiplied by a vector.
Fig. 2b is a schematic diagram of a matrix-by-matrix flow.
Fig. 2c is a schematic diagram of the matrix Ai multiplied by the vector B.
Fig. 2d is a schematic diagram of matrix a multiplied by matrix B.
Fig. 2e is a schematic diagram of the matrix Ai multiplied by the matrix B.
FIG. 3 is a schematic diagram of a neural network chip according to an embodiment of the present disclosure;
Fig. 4 a-4 b are schematic diagrams of two mapping circuits according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present disclosure, a more complete description of the same will be rendered by reference to the appended drawings, wherein it is to be understood that the embodiments are merely some, but not all, of the embodiments of the disclosure. Based on the embodiments in this disclosure, all other embodiments that a person of ordinary skill in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
In the apparatus provided in the first aspect, the main processing circuit includes a first mapping circuit, at least one circuit of the plurality of basic processing circuits includes a second mapping circuit, and the first mapping circuit and the second mapping circuit are both used for performing compression processing of respective data in the neural network operation;
The main processing circuit is used for acquiring an input data block, a weight data block and a multiplication instruction, dividing the input data block into divided data blocks according to the multiplication instruction, and dividing the weight data block into broadcast data blocks; determining to start a first mapping circuit to process the first data block according to the operation control of the multiplication instruction, and obtaining a processed first data block; the first data block comprises the distribution data block and/or the broadcast data block; transmitting the processed first data block to at least one basic processing circuit in basic processing circuits connected with the main processing circuit according to the multiplication instruction;
the basic processing circuits are used for determining whether to start the second mapping circuit to process the second data block according to the operation control of the multiplication instruction, executing the operation in the neural network in a parallel mode according to the processed second data block to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit; the second data block is determined by the basic processing circuit and used for receiving the data block sent by the main processing circuit, and the second data block is associated with the processed first data block; and the main processing circuit is used for processing the operation result to obtain the instruction result of the multiplication instruction.
In the apparatus provided in the first aspect, when the first data block includes a distribution data block and a broadcast data block, the main processing circuit is specifically configured to start the first mapping circuit to process the distribution data block and the broadcast data block to obtain a processed distribution data block and an identification data block associated with the distribution data block, and a processed broadcast data block and an identification data block associated with the broadcast data block; splitting the processed distribution data block and the identification data block associated with the distribution data block to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the plurality of basic data blocks and the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks and the identification data blocks associated with the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;
The basic processing circuit is used for starting the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the basic data block and the identification data block associated with the broadcast data block; and processing the basic data block and the broadcast data block according to the connection identification data block, performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and transmitting the operation result to the main processing circuit.
The identification data block may be specifically represented by a direct index or a step index, and optionally may also be represented by a list (List of Lists, LIL) of a list, a Coordinate list (COO), a compressed sparse row (Compressed Sparse Row, CSR), a compressed sparse column (Compressed Sparse Column, CSC), (ELL Pack, ELL), a mix (Hybird, HYB), and the like.
In the case where the identification data block is represented by direct indexing, the identification data block may specifically be a data block formed by 0 and 1, where 0 indicates that an absolute value of data (such as a weight or an input neuron) included in the data block is less than or equal to a first threshold, and 1 indicates that an absolute value of data (such as a weight or an input neuron) included in the data block is greater than the first threshold, and the first threshold is set randomly by user side or device side user definition, for example, 0.05, 0, and so on.
In order to save data transmission quantity and improve data transmission efficiency, in the process that the main processing circuit sends data to the basic processing circuit, the target data in the basic data blocks and the identification data blocks respectively associated with the basic data blocks can be distributed to the basic processing circuit connected with the target data and the identification data blocks; optionally, the target data in the processed broadcast data block and the identification data block associated with the broadcast data block may also be broadcast to a base processing circuit connected thereto. Wherein the target data is data whose absolute value is greater than a first threshold in a data block, or is non-0 data in a data block (here, specifically, a processed distribution data block or a processed broadcast data block).
For example, the distribution data block is a matrix of M 1 rows and N 1 columns, and the basic data block is a matrix of M 2 rows and N 2 columns, where M 1>M2,N1>N2. Correspondingly, the identification data block associated with the distribution data block is also a matrix of M 1 rows and N 1 columns, and the identification data block associated with the basic data block is also a matrix of M 2 rows and N 2 columns. Taking a matrix with basic data blocks of 2 x 2 as an example, set asThe first threshold is 0.05, and the identification data block associated with the basic data block is/>The processing of the data blocks with respect to the first mapping circuit and the second mapping circuit will be described in detail later.
In the apparatus provided in the first aspect, when the first data block includes a distribution data block, the main processing circuit is specifically configured to start the first mapping circuit to process the distribution data block to obtain a processed distribution data block and an identification data block associated with the distribution data block, or start the first mapping circuit to process the distribution data block according to a prestored identification data block associated with the distribution data block to obtain a processed distribution data block; splitting the processed distributed data blocks and the identification data blocks associated with the distributed data blocks to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the identification data blocks; broadcasting the broadcast data block to a base processing circuit connected thereto;
The basic processing circuit is used for starting the second mapping circuit to process the broadcast data block according to the identification data block related to the basic data block, performing product operation on the processed broadcast data block and the basic data block to obtain an operation result, and sending the operation result to the main processing circuit.
In an optional embodiment, the main processing circuit is further specifically configured to split the broadcast data block or the processed broadcast data block and an identification data block associated with the broadcast data block to obtain a plurality of partial broadcast data blocks and identification data blocks associated with the plurality of partial broadcast data blocks respectively; broadcasting the identification data blocks respectively associated with the plurality of partial broadcast data blocks to the basic processing circuit through one or more times; wherein the plurality of partial broadcast data blocks are combined to form the broadcast data block or a processed broadcast data block.
Correspondingly, the basic processing circuit is specifically configured to start the second mapping circuit to obtain a connection identifier data block according to the identifier data block associated with the partial broadcast data block and the identifier data block associated with the basic data block; processing the partial broadcast data block and the basic data block according to the connection identification data to obtain a processed partial broadcast data block and a processed basic data block; and performing inner product operation on the processed partial broadcast data block and the processed basic data block.
Wherein the connection identification data block is a data block obtained by performing element-by-element and operation on the identification data block associated with the basic data block and the identification data block associated with the partial broadcast data block. Optionally, the connection identification data block is used to represent data in which the data in both data blocks (in particular the basic data block and the broadcast data block) are larger than the absolute value. In particular, as will be described in more detail hereinafter.
For example, a matrix of 2*3 identifying data blocks associated with a distribution data blockMatrix/>, with 2 x 2 identification data blocks associated with partial broadcast data blocksThe corresponding obtained connection identification data block is/>
In an optional embodiment, the main processing circuit is further specifically configured to split the broadcast data block to obtain a plurality of partial broadcast data blocks; broadcasting the plurality of partial broadcast data blocks to the base processing circuit one or more times; wherein the plurality of partial broadcast data blocks are combined to form the broadcast data block or a processed broadcast data block.
Correspondingly, the basic processing circuit is specifically configured to process the partial broadcast data block according to the identification data block associated with the basic data block to obtain a processed partial broadcast data block; and performing inner product operation on the basic data block and the processed partial broadcast data block.
In the apparatus provided in the first aspect, when the first data block includes a broadcast data block, the main processing circuit is specifically configured to start the first mapping circuit to process the broadcast data block to obtain a processed broadcast data block and an identification data block associated with the broadcast data block, or start the first mapping circuit to process the broadcast data block according to a pre-stored identification data block associated with the broadcast data block to obtain a processed broadcast data block; splitting the distributed data blocks to obtain a plurality of basic data blocks; distributing the plurality of basic data to a base processing circuit connected thereto; broadcasting the processed broadcast data block and the identification data block associated with the broadcast data block to a basic processing circuit connected with the broadcast data block;
The basic processing circuit is used for starting the second mapping circuit to process the basic data block according to the identification data block associated with the broadcast data block to obtain a processed basic data block; and performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and sending the operation result to the main processing circuit.
In an optional embodiment, the main processing circuit is further specifically configured to split the processed broadcast data block and an identification data block associated with the broadcast data block to obtain a plurality of partial broadcast data blocks and identification data blocks associated with the plurality of partial broadcast data blocks; broadcasting the identification data blocks respectively associated with the plurality of partial broadcast data blocks to the basic processing circuit through one or more times; wherein the plurality of partial broadcast data blocks are combined to form the broadcast data block or a processed broadcast data block.
Correspondingly, the basic processing circuit is specifically configured to process the basic data block according to the identification data block associated with the partial broadcast data block to obtain a processed basic data block; and performing inner product operation on the processed basic data block and the partial broadcast data block.
In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform a product operation on the basic data block and the broadcast data block to obtain a product result, accumulate the product result to obtain an operation result, and send the operation result to the main processing circuit; and the main processing circuit is used for obtaining an accumulation result after accumulating the operation result and arranging the accumulation result to obtain the instruction result.
In the apparatus provided in the first aspect, the main processing circuit is specifically configured to send the broadcast data block (specifically, the broadcast data block or the processed broadcast data block) to the base processing circuit connected thereto through one broadcast.
In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the basic data block (which may be the basic data block or the processed basic data block) and the broadcast data block to obtain an inner product processing result, accumulate the inner product processing result to obtain an operation result, and send the operation result to the main processing circuit.
In the apparatus provided in the first aspect, the main processing circuit is configured to, when the operation result is a result of the inner product processing, accumulate the operation result to obtain an accumulated result, and arrange the accumulated result to obtain the instruction result.
In the apparatus provided in the first aspect, the main processing circuit is specifically configured to divide the broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the base processing circuit through a plurality of times; the plurality of partial broadcast data blocks are combined to form the broadcast data block.
In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to perform inner product processing on the partial broadcast data block (specifically, the partial broadcast data block or the processed partial broadcast data block) and the basic data block to obtain an inner product processing result, accumulate the inner product processing result to obtain a partial operation result, and send the partial operation result to the main processing circuit. The inner product processing may specifically be: if the elements of the partial broadcast data block are the first 2 elements of the matrix B, namely B10 and B11; the basic data block is the first 2 elements of the first row of the input data matrix a, namely a10 and a11, then the inner product = a10×b10+a11×b11.
In the apparatus provided in the first aspect, the basic processing circuit is specifically configured to multiplex n times of the partial broadcast data blocks, perform inner product operations of the partial broadcast data blocks and the n basic data blocks to obtain n partial processing results, respectively accumulate the n partial processing results to obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.
In the apparatus provided in the first aspect, the main processing circuit includes: a master register or master on-chip cache circuit;
The base processing circuit includes: basic registers or basic on-chip cache circuits.
In the apparatus provided in the first aspect, the main processing circuit includes: a vector operator circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, a first mapping circuit, or a data rearrangement circuit.
In the apparatus provided in the first aspect, the branch processing circuit includes a plurality of branch processing circuits, the main processing circuit is connected to the plurality of branch processing circuits, respectively, and each branch processing circuit is connected to at least one base processing circuit.
In the apparatus provided in the first aspect, the basic processing circuit is further specifically configured to forward the broadcast data block and the basic data block to other basic processing circuits to perform first data processing and then perform inner product operation to obtain an operation result, and send the operation result to the main processing circuit;
and the main processing circuit is used for processing the operation result to obtain the data block to be calculated and an instruction result of the operation instruction.
In the apparatus provided in the first aspect, the data block may be represented by a tensor, which may specifically be: vector, matrix, three-dimensional data block, four-dimensional data block, and n-dimensional data block.
In the apparatus provided in the first aspect, if the operation instruction is a multiplication instruction, the main processing circuit determines that a multiplier data block is a broadcast data block and a multiplicand data block is a distribution data block;
If the operation instruction is a convolution instruction, the main processing circuit determines that the input data block is a broadcast data block, and the convolution kernel is a distribution data block.
In the method provided in the sixth aspect, the operation of the neural network includes: one or any combination of convolution operation, matrix multiplication matrix operation, matrix multiplication vector operation, paranoid operation, full connection operation, GEMM operation, GEMV operation and activation operation.
Referring to fig. 1a, fig. 1a is a schematic structural diagram of an integrated circuit chip device, as shown in fig. 1a, the chip device includes: main processing circuitry, basic processing circuitry, and branch processing circuitry (optional). Wherein,
The main processing circuit may include a register and/or an on-chip buffer circuit, and may further include a control circuit, a vector arithmetic unit circuit, an ALU (ARITHMETIC AND logic unit) circuit, an accumulator circuit, a DMA (Direct Memory Access) circuit, etc., although in practical applications, the main processing circuit may also include other circuits such as a conversion circuit (for example, a matrix transpose circuit), a data rearrangement circuit, an activation circuit, etc.;
alternatively, the main processing circuit may include: the first mapping circuit can be used for processing the received or transmitted data to obtain processed data and identification mask data associated with the data, wherein the identification mask data is used for indicating whether the absolute value of the data is larger than a preset threshold value, and optionally, the mask data can be 0 or 1, wherein 0 represents that the absolute value of the data is smaller than or equal to the preset threshold value; otherwise, 1 indicates that the absolute value of the data is greater than the preset threshold. The preset threshold is set by user side or terminal equipment side, for example, 0.1 or 0.05, etc. In practical applications, the first mapping circuit may reject data with a data value of 0 or not greater than a preset threshold (e.g., 0.1), or set the data value to 0. The method has the advantages of reducing the data quantity transmitted from the main processing circuit to the basic processing circuit, reducing the calculated quantity of data operation in the basic processing circuit and improving the data processing efficiency. The present invention is not limited to the specific form of the first mapping circuit described above. A specific implementation of the first mapping circuit will be explained below.
For example, the input data of the main processing circuit is matrix data blocksThe matrix data block after the processing can be obtained as/>, after the processing of the first mapping circuitThe identification data block associated with the matrix data block is/>The specific processing of the first mapping circuit will be described later.
Accordingly, when the main processing circuit distributes data to the basic processing circuit, only two data of 1 and 0.5 can be transmitted, and the processed matrix data block and 8 data are not transmitted; meanwhile, the identification data blocks associated with the matrix data blocks are required to be sent to the basic processing circuit together, so that the basic processing circuit correspondingly knows that the two data are positioned at the positions of the original matrix data blocks according to the received identification data blocks and the received two data (1 and 0.5). That is, the basic processing circuit may correspondingly restore the matrix data block processed in the main processing circuit according to the received identification data block and the received data.
The main processing circuit also comprises a data transmitting circuit, a data receiving circuit or an interface, wherein the data transmitting circuit can integrate the data distributing circuit and the data broadcasting circuit, and the data distributing circuit and the data broadcasting circuit can be respectively arranged in practical application; in practical applications, the data transmitting circuit and the data receiving circuit may be integrated together to form a data transceiver circuit. For broadcast data, i.e. data that needs to be sent to each basic processing circuit. For distributing data, that is, data that needs to be selectively sent to a part of the basic processing circuit, a specific selection mode can be specifically determined by the main processing circuit according to the load and the calculation mode. For the broadcast transmission scheme, broadcast data is transmitted in broadcast form to each base processing circuit. (in practical applications, broadcast data is transmitted to each basic processing circuit by a one-time broadcast method, or broadcast data may be transmitted to each basic processing circuit by a multiple-time broadcast method, and the number of times of the broadcast is not limited in the embodiment of the present application).
When the data distribution is realized, the control circuit of the main processing circuit transmits the data to part or all of the basic processing circuits (the data can be the same or different, specifically, if the data is transmitted in a distribution mode, the data received by the basic processing circuits of each received data can be different, and the data received by part of the basic processing circuits can be the same;
specifically, when broadcasting data, the control circuit of the main processing circuit transmits the data to part or all of the basic processing circuits, and the basic processing circuits receiving the data can receive the same data.
Alternatively, the vector operator circuit of the main processing circuit may perform vector operations, including but not limited to: two vectors add, subtract, multiply, divide, add to, subtract from, multiply, divide, or perform any operation on each element in the vector. The continuous operation may be vector and constant addition, subtraction, multiplication, division, activation, accumulation, etc.
Each base processing circuit may include a base register and/or a base on-chip cache circuit; each base processing circuit may further include: an inner product operator circuit, a vector operator circuit, an accumulator circuit, or the like. The inner product arithmetic circuit, the vector arithmetic circuit, and the accumulator circuit may be integrated circuits, or may be individually provided.
The chip arrangement may optionally further comprise one or more branch processing circuits, such as with a branch processing circuit, wherein the main processing circuit is connected to the branch processing circuit, the branch processing circuit is connected to the basic processing circuit, the inner product operator circuit of the basic processing circuit is arranged to perform inner product operations between data blocks, the control circuit of the main processing circuit controls the data receiving circuit or the data transmitting circuit to transmit and receive external data, and the control circuit controls the data transmitting circuit to distribute the external data to the branch processing circuit, the branch processing circuit is arranged to transmit and receive data of the main processing circuit or the basic processing circuit. The architecture shown in fig. 1a is suitable for the computation of complex data, because the number of connected units is limited for the main processing circuit, so that branch processing circuits need to be added between the main processing circuit and the basic processing circuit to realize the access of more basic processing circuits, thereby realizing the computation of complex data blocks. The connection structure of the branch processing circuit and the basic processing circuit may be arbitrary, and is not limited to the H-type structure of fig. 1 a. Alternatively, the main processing circuit to the base processing circuit is a broadcast or distributed structure, and the base processing circuit to the main processing circuit is a gather (gather) structure. The definition of broadcast, distribution and collection is as follows, and for a distribution or broadcast structure, the number of basic processing circuits at this time is greater than that of main processing circuits, i.e. 1 main processing circuit corresponds to a plurality of basic processing circuits, i.e. a structure from the main processing circuit to the plurality of basic processing circuits is broadcast or distributed, whereas from the plurality of basic processing circuits to the main processing circuit may be a collection structure.
The basic processing circuit receives data distributed or broadcast by the main processing circuit and stores the data in an on-chip cache of the basic processing circuit, can perform operation to generate a result, and can send the data to the main processing circuit. Optionally, the basic processing circuit may also process the received data first, store the processed data in an on-chip buffer, and may also utilize the processed data to perform operation to generate a result, or optionally may also send the processed data to other basic processing circuits or a main processing circuit.
Optionally, each basic processing circuit may include a second mapping circuit, and the second mapping circuit may be configured in a part of the basic processing circuits; the second mapping circuit may be used to process (i.e., compress) the received or transmitted data. The present invention is not limited to the specific form of the second mapping circuit described above. The implementation of the second mapping circuit will be described in detail below.
Alternatively, the vector arithmetic circuit of the basic processing circuit may perform vector arithmetic on two vectors (either or both of the two vectors may be processed vectors), and of course, in practical applications, the inner product arithmetic circuit of the basic processing circuit may perform inner product arithmetic on the two vectors, and the accumulator circuit may also accumulate the result of the inner product arithmetic.
In one alternative, the two vectors may be stored in on-chip caches and/or registers, and the underlying processing circuitry may extract the two vectors to perform the operation as needed for the actual computation. The operation includes, but is not limited to: inner product operations, multiplication operations, addition operations, or other operations.
In one alternative, the results of the inner product operation may be accumulated onto an on-chip cache and/or register; the alternative scheme has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.
In an alternative, the result of the inner product operation is not accumulated and is directly transmitted as a result; the technical scheme has the advantages that the operation amount in the basic processing circuit is reduced, and the operation efficiency of the basic processing circuit is improved.
In an alternative scheme, each basic processing circuit can execute inner product operation of multiple groups of two vectors, and can also respectively accumulate the results of the multiple groups of inner product operation;
in one alternative, multiple sets of two vector data may be stored in on-chip caches and/or registers;
in one alternative, the results of the multiple sets of inner-product operations may be accumulated in on-chip caches and/or registers, respectively;
in an alternative scheme, the results of the inner product operations of each group can be directly transmitted as the results without accumulation;
In one alternative, each basic processing circuit may perform an operation of performing inner product operations on the same vector and a plurality of vectors, respectively ("one-to-many" inner products, i.e., two vectors in each of the plurality of sets of inner products are shared), and accumulate the inner product results corresponding to each vector, respectively. According to the technical scheme, the same set of weight values can be used for calculating different input data for multiple times, so that the data multiplexing is increased, the data transmission quantity of the data in the basic processing circuit is reduced, the calculation efficiency is improved, and the power consumption is reduced.
Specifically, in the data used to calculate the inner product, the data sources of the shared vector of each set and the other vector of each set (i.e., the vector that differs between each set) may differ:
In one alternative, the sets of shared vectors are broadcast or distributed from the main processing circuit or branch processing circuit in calculating the inner product;
In one alternative, each set of shared vectors comes from an on-chip cache when computing the inner product;
In one alternative, the shared sets of vectors come from registers when the inner product is calculated;
in one alternative, in calculating the inner product, another unshared vector of each group is broadcast or distributed from the main processing circuit or the branch processing circuit;
in one alternative, in calculating the inner product, another unshared vector of each group is from the on-chip cache;
in one alternative, in calculating the inner product, another unshared vector for each group comes from a register;
In one alternative, each set of shared vectors remains arbitrary in the on-chip caches and/or registers of the underlying processing circuitry while performing multiple sets of inner product operations;
in one alternative, the shared vector may be kept one for each set of inner products;
in one alternative, the shared vector may be kept in only one part;
Specifically, the results of the multiple sets of inner product operations may be accumulated in on-chip caches and/or registers, respectively;
specifically, the results of each group of inner product operations may be directly transmitted as the results without accumulation;
In an alternative, the vector or matrix involved in the basic processing circuit may be a vector or matrix processed by the second mapping circuit, as will be explained in detail later.
Referring to FIG. 1a, a structure is shown that includes a main processing circuit (capable of performing vector operations) and multiple basic processing circuits (capable of performing inner product operations). The advantages of such a combination are: the device can not only use the basic processing circuit to execute matrix and vector multiplication operation, but also use the main processing circuit to execute other arbitrary vector operation, so that the device can more quickly complete more operation under the configuration of a limited hardware circuit, the number of times of data transmission with the outside of the device is reduced, the calculation efficiency is improved, and the power consumption is reduced. In addition, the chip may set a first mapping circuit in the main processing circuit to perform processing of data in the neural network, for example, reject the first input data smaller than or equal to a preset threshold value, and may obtain the mask data associated with the first input data, where the mask data is used to indicate whether the absolute value of the first input data is greater than the preset threshold value. The foregoing embodiments may be referred to specifically, and will not be described herein in detail. The design has the advantages of reducing the data quantity transmitted to the basic processing circuit, reducing the calculated quantity of the data of the basic processing circuit, improving the data processing rate and reducing the power consumption.
The second mapping circuit may be configured in the basic processing circuit to perform processing of data in the neural network, for example, processing the second input data according to mask data associated with the first input data or selecting first input data and second input data with absolute values greater than a preset threshold according to mask data associated with the first input data and mask data associated with the second input data, and so on. For specific processing of the data by the first mapping circuit and the second mapping circuit, see details below.
Optionally, the first mapping circuit and the second mapping circuit are each configured to process data, and may be specifically designed into any one or more of the following circuits: main processing circuit, branch processing circuit, basic processing circuit, etc. Therefore, the calculated data volume can be reduced when the neural network calculation is carried out, the chip can dynamically allocate the circuit to carry out data compression processing according to the calculation volume (namely the load quantity) of each circuit (mainly the main processing circuit and the basic processing circuit), so that the complex program of the data calculation can be reduced, the power consumption is reduced, and the dynamic allocation data processing can realize the calculation efficiency of the chip without being influenced. The manner of allocation includes, but is not limited to: load balancing, load minimum allocation, and the like.
Referring to the apparatus shown in fig. 1b, the apparatus shown in fig. 1b is a computing apparatus without branch processing circuitry, and the apparatus shown in fig. 1b includes: the main processing circuit and the N basic processing circuits, where the main processing circuit (specific structure is shown in fig. 1 c) may be directly or indirectly connected to the N basic processing circuits, for example, in an indirect connection manner, an alternative scheme may include N/4 branch processing circuits as shown in fig. 1a, each branch processing circuit is connected to 4 basic processing circuits, the main processing circuit and the N basic processing circuits may be referred to the above description as shown in fig. 1a, and the basic processing circuits may be further provided in the branch processing circuits, and in addition, the number of the basic processing circuits connected to each branch processing circuit may be not limited to 4, and the manufacturer may configure the main processing circuit and the N basic processing circuits according to actual needs. The main processing circuit and the N basic processing circuits may be respectively designed with a first mapping circuit and a second mapping circuit, and specifically, the main processing circuit may include the first mapping circuit, and the N basic processing circuits or a part of the N basic processing circuits includes the second mapping circuit; the main processing circuit may also include a first mapping circuit and a second mapping circuit, which may also mean that the N basic processing circuits or a part thereof includes a first mapping circuit and a second mapping circuit. The main processing circuit may dynamically allocate the operation entity of the data compression processing step according to the neural network calculation instruction, specifically, the main processing circuit may determine whether to execute the compression processing step on the received data according to its own load, specifically, may set a plurality of intervals for each interval corresponding to the execution body of the data compression processing step, for example, taking 3 intervals as an example, the load value of interval 1 is lower, and may execute the data compression processing step by N basic processing circuits. The main processing circuit independently executes the data compression processing steps, the interval 2 load value is located between the interval 1 and the interval 3, the main processing circuit can independently execute the data compression processing steps, the interval 3 load value is higher, and the main processing circuit or the N basic processing circuits can jointly execute the data compression processing steps. In this regard, it may be performed in an explicit manner, e.g., the main processing circuit may be configured with a special instruction or instruction that, when received by the base processing circuit, determines to perform the data compression processing step, e.g., when not received by the base processing circuit, determines not to perform the data compression processing step. As another example, this may be performed implicitly, e.g., where the underlying processing circuitry receives sparse data (i.e., containing 0's, or including data less than a preset threshold greater than a preset number) and determines that an inner product operation needs to be performed, the sparse data is subjected to compression processing.
The data compression processing to which the present application relates is specifically performed in the first mapping circuit and the second mapping circuit described above. It should be appreciated that since the neural network is a high computational and memory algorithm, the more weights, the more computation and memory are. In particular, in the case of a smaller weight (e.g., a weight of 0 or less than a set value), the data with the smaller weight needs to be compressed to increase the calculation rate and reduce the overhead. In practical application, the data compression processing is applied to a sparse neural network, and has the most obvious effect, such as reducing the workload of data calculation, reducing the data overhead, improving the data calculation rate and the like.
Taking input data as an example, specific embodiments related to the data compression process are described. The input data includes, but is not limited to, at least one input neuron and/or at least one weight.
In the first embodiment:
After the first mapping circuit receives the first input data (specifically, the data block to be calculated, such as a distribution data block or a broadcast data block, etc., which may be sent by the main processing circuit), the first mapping circuit may process the first input data to obtain identification mask data associated with the processed first input data by the first input data, where the mask data is used to indicate whether an absolute value of the first input data is greater than a first threshold, such as 0.5, 0, etc.
Specifically, when the absolute value of the first input data is greater than a first threshold value, the input data is reserved; otherwise, deleting the first input data or setting the first input data to 0. For example, the matrix data block is input asThe first threshold value is 0.05, and the matrix data block/> after being processed by the first mapping circuit can be obtainedThe identification data block (also referred to as mask matrix) associated with the matrix data block is/>
Further, in order to reduce the data transmission amount, when the main processing circuit distributes data to the base processing circuit connected with the main processing circuit, the target data (1,0.06 and 0.5 in this example) in the processed matrix data block and the identification data block associated with the matrix data block may be sent. In a specific implementation, the main processing circuit may distribute the target data in the processed matrix data block to the basic processing circuit according to a set rule, for example, sequentially sending the target data according to a row order or sequentially sending the target data according to a column order, etc., which is not limited by the present application. Accordingly, the basic processing circuit restores the target data and the identification data blocks corresponding to the target data into the processed matrix data blocks according to a set rule (such as a row sequence) after receiving the target data and the identification data blocks corresponding to the target data. For example, in this example, the base processing circuitry may identify the data block based on the received data (1,0.06 and 0.5)The matrix data block corresponding to the data (i.e. the matrix data block processed by the first mapping circuit in the main processing circuit) can be known as/>
In an embodiment of the present invention, the first input data may be a distribution data block and/or a broadcast data block.
Correspondingly, the second mapping circuit can process the second input data by utilizing the identification data associated with the first input data, so as to obtain processed second input data; wherein the first input data is different from the second input data. For example, when the first input data is at least one weight, then the second input data may be at least one input neuron; or when the first input data is at least one input neuron, the second input data may be at least one weight.
In an embodiment of the present invention, the second input data is different from the first input data, and the second input data may be any one of the following: distribution data blocks, basic data blocks, broadcast data blocks, and partial broadcast data blocks.
For example, when the first input data is a distribution data block, then the second input data is a partial broadcast data block. Assuming that the second input data is a matrix data blockCorrespondingly utilizing mask matrix in the above exampleAfter processing, the processed partial broadcast data block is obtained as/>Since the dimension of the matrix data block related to the input data is large in practical application, the present application is only illustrative and not limited in this respect.
In a second embodiment:
The first mapping circuit may be configured to process the first input data and the second input data to obtain processed first input data, first identification mask data associated with the first input data, processed second input data, and second identification mask data associated with the second input data. The first mask data or the second mask data is used for indicating whether the absolute value of the first or the second input data is greater than a second threshold value, and the second threshold value is set by user side or device side in a self-defining mode, for example, 0.05, 0 and the like.
The processed first input data or the processed second input data may be processed input data or unprocessed input data. For example, the first input data is a distribution data block, such as a matrix data block in the above exampleThe processed distribution data block can be obtained after the processing of the first mapping circuit, wherein the processed distribution data block can be the original matrix data block/>Or compressed matrix data blockIt will be appreciated that the present application is intended to reduce the transmission of data amounts and the efficiency of data processing in the underlying processing circuitry, preferably the processed input data (e.g., processed basic data blocks or partially broadcast data blocks, etc.) should be compressed processed data. Preferably, the data sent by the main processing circuit to the basic processing circuit may be specifically target data in the processed input data, where the target data may be specifically data with an absolute value greater than a preset threshold value, or may be non-0 data, etc.
Correspondingly, in the basic processing circuit, the second mapping circuit can obtain connection identification data according to the first identification data associated with the first input data and the second identification data associated with the second input data; the connection identification data is used for indicating data with absolute values larger than a third threshold value in the first input data and the second input data, wherein the third threshold value is set by a user side or a device side in a self-defining way, such as 0.05, 0 and the like. Further, the second mapping circuit may process the received first input data and the second input data according to the connection identification data, respectively, so as to obtain the processed first input data and the processed second input data.
For example, the first input data is a matrix data blockThe second input data block is also matrix data block/>After being processed by a first mapping circuit, the first identification data block/>, associated with the first input data, can be obtainedProcessed first input data block/>Correspondingly obtaining the second identification data block/>, associated with the second input dataThe processed second input data block isAccordingly, in order to increase the data transmission rate, the main processing circuit may only send the target data 1,0.06 and 0.5 in the processed first input data block and the first identification data block associated with the first input data block to the basic processing circuit; and simultaneously, transmitting target data 1,1.1,0.6,0.3 and 0.5 in the processed second input data block and a second identification data block associated with the second input data block to the basic processing circuit.
Correspondingly, after the basic processing circuit receives the data, the second mapping circuit can perform element-by-element and operation on the first identification data block and the second identification data block to obtain a connection identification data blockCorrespondingly, the second mapping circuit respectively processes the processed first input data block and the processed second input data block by using the connection identification data block, thereby obtaining the processed first input data block as/>The processed second input data block is/>The basic processing circuit can determine a first data block (i.e., the first data block processed by the first mapping circuit) corresponding to the target data according to the first identification data block and the target data in the received first data block; correspondingly, determining a second data block (namely the second data block processed by the first mapping circuit) corresponding to the target data according to the second identification data block and the target data in the received second data block; then, after the second mapping circuit learns the connection identification data block, performing element-by-element and operation with the determined first data block and the determined second data block respectively by using the connection identification data block to obtain a first data block processed by the second mapping circuit and a processed second data block.
In a third embodiment:
The first mapping circuit is not arranged in the main processing circuit, but the main processing circuit can send third input data and prestored third identification data associated with the third input data to a basic processing circuit connected with the third input data. The basic processing circuit is provided with a second mapping circuit. A specific embodiment of the data compression process involved in the second mapping circuit is set forth below.
It should be appreciated that the third input data includes, but is not limited to, a base data block, a partial broadcast data block, a broadcast data block, and the like. Likewise, in the neural network processor, the third input data may also be at least one weight, and/or at least one input nerve, which is not limited by the present application.
In the second mapping circuit, the second mapping circuit may process the third input data according to third identification data associated with the received third input data, so as to obtain processed third input data, so as to perform a related operation, such as an inner product operation, on the processed third input data.
For example, the third input data received by the second mapping circuit is a matrix data blockA corresponding pre-stored third identification data block (also referred to as mask matrix data block) associated with the third input data is/>Further, the second mapping circuit processes the third input data block according to the third identification data block to obtain a processed third input data block, which is specifically/>
In addition, the input neuron and the output neuron mentioned in the embodiment of the present invention do not refer to a neuron in an input layer and a neuron in an output layer of the entire neural network, but are input neurons as neurons in a lower layer of a network feedforward operation, and output neurons as neurons in an upper layer of the network feedforward operation for any two adjacent layers of neurons in the neural network. Taking convolutional neural networks as an example, let a convolutional neural network have L layers, k=1, 2,3 … L-1, for the K layer and the k+1th layer, the K layer is called an input layer, the neurons in the layer are the input neurons, the k+1th layer is called an input layer, the neurons in the layer are the output neurons, that is, each layer can be used as an input layer except the top layer, and the next layer is the corresponding output layer.
In a fourth implementation:
The main processing circuit is not provided with a mapping circuit, and the basic processing circuit is provided with a first mapping circuit and a second mapping circuit. The data processing of the first mapping circuit and the second mapping circuit may be specifically described with reference to the foregoing first to third embodiments, and will not be described herein.
Optionally, there is also a fifth embodiment. In the fifth embodiment, the mapping circuit is not disposed in the basic processing circuit, and the first mapping circuit and the second mapping circuit are disposed in the main processing circuit, and the data processing of the first mapping circuit and the second mapping circuit can be specifically described with reference to the foregoing first embodiment to the third embodiment, which is not repeated herein. That is, the main processing circuit completes the compression processing of the data, and sends the processed input data to the base processing circuit, so that the base processing circuit performs corresponding arithmetic operations using the processed input data (specifically, the processed neurons and the processed weights).
The following illustrates a specific structural diagram of the mapping circuit. Two possible mapping circuits are shown in fig. 4a and 4 b. Wherein the mapping circuit as shown in fig. 4a comprises a comparator and a selector. The present application is not limited with respect to the number of comparators and selectors. Fig. 4a shows a comparator and two selectors, wherein the comparator is used to determine whether the input data satisfies a preset condition. The preset condition may be set by user side or device side, for example, the absolute value of the input data is greater than or equal to a preset threshold. If the preset condition is met, the comparator can determine that the input data is allowed to be output, and the input data corresponds to the associated identification data to be 1; otherwise, it may be determined not to output the input data, or default the input data to 0. Accordingly, at this time, the identification data corresponding to the input data is 0. That is, after passing through the comparator, the identification data associated with the input data can be known.
Further, after the comparator determines the preset condition on the input data, the obtained identification data may be input into the selector, so that the selector uses the identification data to determine whether to output the corresponding input data, i.e. obtain the processed input data.
As shown in fig. 4a, taking the input data as a matrix data block as an example, a comparator may determine a preset condition for each data in the matrix data block, so as to obtain an identification data block (mask matrix) associated with the matrix data block. Further, the identification data block may be used in the first selector to screen the matrix data block, where data with an absolute value greater than or equal to a preset threshold value (i.e., a preset condition is satisfied) in the matrix data block is retained, and the remaining data is deleted, so as to output the processed matrix data block. Optionally, the identification data block may be used in the second selector to process other input data (for example, the second matrix data block), for example, performing element-by-element and operation, so as to reserve data in the second matrix data block with an absolute value greater than or equal to a preset threshold value, so as to output the processed second matrix data block.
It should be appreciated that the specific structure of the first mapping circuit, corresponding to the first and second embodiments described above, may include at least one comparator and at least one selector, such as the comparator and first selector of fig. 4a, for example; the specific result of the second mapping circuit may include one or more selectors, such as the second selector of fig. 4a in the example above.
Fig. 4b shows a schematic diagram of another mapping circuit. As shown in fig. 4b, the mapping circuit includes a number of selectors, which may be one or more, without limitation. Specifically, the selector is configured to select input data according to identification data associated with the input data, so as to output data with an absolute value greater than or equal to a preset threshold value in the input data, and delete/not output other data, thereby obtaining processed input data.
Taking the input data as a matrix data block as an example, inputting the matrix data block and an identification data block associated with the matrix data block into the mapping circuit, selecting the matrix data block by a selector according to the identification data block, outputting data with an absolute value greater than or equal to 0, and outputting the rest data without outputting the rest data, thereby outputting the processed matrix data block.
It should be appreciated that the structure shown in fig. 4b may be applied to the second mapping circuit in the third embodiment described above, i.e. the specific result of the second mapping circuit in the third embodiment described above may include at least one selector. Similarly, the first mapping circuit and the second mapping circuit designed in the main processing circuit and the basic processing circuit may be cross-combined or split according to the functional components shown in fig. 4a and 4b, and the present application is not limited thereto.
The following provides a method for implementing calculation by using the neural network device shown in fig. 1a, where the calculation method may specifically be a calculation mode of the neural network, for example, training of the neural network, and in practical application, the forward operation may perform operations such as matrix multiplication, convolution operation, activation operation, transformation operation, etc. according to different input data, where the operations may be implemented by using the device shown in fig. 1 a.
The first mapping circuit of the main processing circuit firstly performs compression processing on the data and then transmits the data to the basic processing circuit for operation by the control circuit, for example, the first mapping circuit of the main processing circuit can perform compression processing on the data and then transmits the data to the basic processing circuit.
The main processing circuit transmits the data to be calculated to all or a part of the basic processing circuits; taking matrix multiplication and vector calculation as an example, the control circuit of the main processing circuit can split matrix data into each column as basic data, for example, an m×n matrix, and can split the matrix data into n m rows of vectors, and the control circuit of the main processing circuit distributes the split n m rows of vectors to a plurality of basic processing circuits. For vectors, the control circuitry of the main processing circuitry may broadcast the vector as a whole to each of the base processing circuitry. If the value of m is relatively large, the control circuit may split the m×n matrix into x×n vectors, for example, x=2, and specifically may split the m×n vectors, where each vector includes m/2 rows, i.e., each vector in n m rows is equally divided into 2 vectors, for example, the first vector of n m rows is equal to 1000 rows, for example, the first vector of n m rows may be equally divided into 2 vectors, the first 500 rows may be formed into a first vector, the second 500 rows may be formed into a second vector, and the control circuit broadcasts the 2 vectors to the plurality of basic processing circuits through 2 broadcasts.
The data transmission mode can be broadcasting or distributing, or any other possible transmission mode;
after receiving the data, the basic processing circuit processes the data through the second mapping circuit and then executes operation to obtain an operation result;
the basic processing circuit transmits the operation result back to the main processing circuit;
the operation result may be an intermediate operation result or a final operation result.
Using the apparatus shown in fig. 1a, the tensor multiplication operation can be performed, where the tensor is the same as the data block described above, and may be any one or more of a matrix, a vector, a three-dimensional data block, a four-bit data block, and a high-dimensional data block; a specific implementation of the matrix multiplication vector and matrix multiplication matrix operation is shown below as fig. 2 and 2b, respectively.
Performing a matrix multiplication vector operation using the apparatus shown in FIG. 1 a; (a matrix-by-vector may be one in which each row in the matrix is respectively inner-product with the vector and the results are placed into a vector in the order of the corresponding row.)
The operation of calculating the multiplication of a matrix S of size M rows and L columns and a vector P of length L is described below, as shown in fig. 2a, the neural network computing device having K basic processing circuits (each row in matrix S is the same length as vector P, data in them corresponds to position one by one):
Referring to fig. 2, fig. 2 provides a method for implementing matrix multiplication vectors, which specifically may include:
Step S201, a control circuit of a main processing circuit distributes each row of data in a matrix S to one of K basic processing circuits, and the basic processing circuits store the received distributed data in on-chip caches and/or registers of the basic processing circuits;
In an alternative, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, so as to obtain a processed matrix S and a first identification (mask) matrix associated with the matrix S. Or the first mapping circuit of the main processing circuit processes the matrix S according to a first mask matrix associated with the pre-stored matrix S to obtain a processed matrix S. Further, each row of data in the processed matrix S is sent to one or more of the K basic processing circuits together with the identification data corresponding to the row of data and associated with the row of data in the first mask matrix through the control circuit. When the main processing circuit sends data to the basic processing circuit, the data with absolute value larger than a preset threshold value in the processed matrix S or non-0 data can be sent to the basic processing circuit, so that the data transmission quantity is reduced. For example, the set of rows in the matrix S distributed to the i-th basic processing circuit after processing is Ai, and Mi rows are shared; correspondingly, the identification matrix Bi corresponding to Ai is distributed at the same time, wherein Bi is a part of the first mask matrix, and the total number of the identification matrix Bi is greater than or equal to Mi rows.
In an alternative, if the number M < = K of rows of the matrix S, the control circuit of the main processing circuit distributes one row of the S matrix to K basic processing circuits, respectively; optionally, identification data of a row corresponding to the row in the first identification matrix is also transmitted at the same time;
In an alternative, the control circuit of the main processing circuit distributes the data of one or more rows of the S matrix to each base processing circuit, respectively, if the number M of rows of the matrix S > K. Optionally, identification data corresponding to a row in the first identification matrix by the one or more rows is also transmitted at the same time;
the set of rows in S distributed to the ith basic processing circuit is Ai, with a total of Mi rows, as shown in fig. 2c for the calculations to be performed on the ith basic processing circuit.
In an alternative, in each basic processing circuit, for example in the i-th basic processing circuit, the received distribution data, for example the matrix Ai, may be saved in a register and/or on-chip cache of the i-th basic processing circuit; the method has the advantages of reducing the data transmission quantity of the distributed data, improving the calculation efficiency and reducing the power consumption.
Step S202, a control circuit of a main processing circuit transmits each part of the vector P to K basic processing circuits in a broadcast mode;
In one alternative, the data (portions) of the vector P may be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the vector P, so as to obtain a processed vector P and a second identification (mask) matrix associated with the vector P. Or the first mapping circuit of the main processing circuit processes the vector P according to a second mask matrix associated with the pre-stored vector P to obtain the processed vector P. Further, the data (i.e. each portion) in the processed vector P and the identification data corresponding to the data and associated in the second mask matrix are sent to one or more of the K basic processing circuits together by the control circuit. When the main processing circuit sends data to the basic processing circuit, the data with absolute value larger than a preset threshold value in the processed vector P or non-0 data can be specifically sent to the basic processing circuit so as to reduce the data transmission quantity.
In an alternative, the control circuit of the main processing circuit may broadcast each portion of the vector P only once to the register or on-chip buffer of each basic processing circuit, and the ith basic processing circuit sufficiently multiplexes the data of the vector P obtained this time to complete the inner product operation corresponding to each row in the matrix Ai. The method has the advantages of reducing the data transmission quantity of repeated transmission of the vector P from the main processing circuit to the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.
In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip buffer of each basic processing circuit for multiple times, where the ith basic processing circuit does not multiplex the data of the vector P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai in multiple times; the vector P data transmission method has the advantages that the data transmission quantity of the vector P of single transmission in the basic processing circuit is reduced, the capacity of a buffer memory and/or a register of the basic processing circuit can be reduced, the execution efficiency is improved, the transmission power consumption is reduced, and the cost is reduced.
In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the vector P to the register or on-chip buffer of each basic processing circuit for multiple times, and the ith basic processing circuit performs part multiplexing on the data of the vector P obtained each time to complete inner product operation corresponding to each row in the matrix Ai; the method has the advantages of reducing the data transmission quantity from the main processing circuit to the basic processing circuit, reducing the data transmission quantity in the basic processing circuit, improving the execution efficiency and reducing the transmission power consumption.
Step S203, the inner product arithmetic circuit of the K basic processing circuits calculates the inner products of the data of the matrix S and the vector P, for example, the i-th basic processing circuit calculates the inner products of the data of the matrix Ai and the data of the vector P;
in a specific embodiment, the basic processing circuit receives the data in the processed matrix S and the identification data associated with the data in the first mask matrix; and also receives the data in the processed vector P. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the vector P according to the identification data in the received first mask matrix, and the processed data of the vector P is obtained. Further, the basic processing circuit enables the inner product operator circuit to perform inner product operation on the received data in the processed matrix S and the data of the processed vector P, so as to obtain a result of the inner product operation. For example, the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, and the vector P; at this time, the second mapping circuit can be started to process the vector P by using Bi to obtain a processed vector P; and then enabling the inner product arithmetic circuit to carry out inner product operation on the matrix Ai and the processed vector P.
In a specific embodiment, the basic processing circuit receives the data in the processed vector P and the identification data associated with the data in the second mask matrix; and also receives the data in the processed matrix S. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix S according to the identification data in the received second mask matrix, and the processed data of the matrix S is obtained. Further, the basic processing circuit enables the inner product operator circuit to execute inner product operation on the received data of the processed vector P and the data in the processed matrix S, and a result of the inner product operation is obtained. For example, the ith basic processing circuit receives the matrix Ai, the processed vector P and the second identification matrix associated with the vector P; at this time, a second mapping circuit can be started to process the Ai by using a second identification matrix to obtain a processed matrix Ai; and then enabling the inner product arithmetic circuit to carry out inner product operation on the processed matrix Ai and the processed vector P.
In a specific embodiment, the basic processing circuit receives the data in the processed matrix S and the identification data associated with the data in the first mask matrix; and meanwhile, the data in the processed vector P and the identification data associated with the data in the second mask matrix are also received. Correspondingly, the basic processing circuit enables the second mapping circuit to obtain a relation identification matrix according to the received identification data in the first mask matrix and the received identification data in the second mask matrix; and then respectively processing the received data in the matrix S and the received data in the vector P by utilizing the identification data in the relation identification matrix to obtain the processed data of the matrix S and the processed data of the vector P. Further, the inner product operator circuit is enabled to execute inner product operation on the data in the processed matrix S and the data of the processed vector P, and a result of the inner product operation is obtained. For example, the i-th basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the vector P, and the second identification matrix associated with the vector P; at this time, the second mapping circuit may be enabled to obtain a relationship identification matrix by using Bi and the second identification matrix, and then the relationship identification matrix is used to process the matrix Ai and the vector P simultaneously or respectively, so as to obtain a processed matrix Ai and a processed vector P. Then, the inner product operator circuit is enabled to perform inner product operation on the processed matrix Ai and the processed vector P.
And S204, accumulating the results of the inner product operation by the accumulator circuits of the K basic processing circuits to obtain accumulated results, and transmitting the accumulated results back to the main processing circuit in a fixed-point type mode.
In an alternative, the partial sums (a part of the partial sums, i.e. the accumulated result, for example, f1×g1+f2×g2+f3×g3+f4×g4+f5×g5) obtained by each time the basic processing circuit performs the inner product operation may be transmitted back to the main processing circuit for accumulation; the method has the advantages of reducing the operation amount in the basic processing circuit and improving the operation efficiency of the basic processing circuit.
In an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the part is transmitted back to the main processing circuit after accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency and reducing the data transmission power consumption.
In an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit and the part stored in the register and/or the on-chip buffer of the basic processing circuit are accumulated in part cases, and the part is transmitted to the main processing circuit for accumulation, and the part is transmitted back to the main processing circuit after the accumulation is finished; the method has the advantages of reducing the data transmission quantity between the basic processing circuit and the main processing circuit, improving the operation efficiency, reducing the data transmission power consumption, reducing the operation quantity in the basic processing circuit and improving the operation efficiency of the basic processing circuit.
Referring to FIG. 2b, the operation of matrix multiplication is performed using the apparatus shown in FIG. 1 a;
the operation of calculating the multiplication of a matrix S of size M rows and L columns and a matrix P of size L rows and N columns (each row in matrix S being the same length as each column of matrix P, as shown in fig. 2 d) is described below, the neural network calculation device having K basic processing circuits:
Step S201b, a control circuit of the main processing circuit distributes data of each row in the matrix S to one of K basic processing circuits, and the basic processing circuits store the received data in on-chip caches and/or registers;
In an alternative, the data of the matrix S is processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix S, so as to obtain a processed matrix S and a first identification (mask) matrix associated with the matrix S. Or the first mapping circuit of the main processing circuit processes the matrix S according to a first mask matrix associated with the pre-stored matrix S to obtain a processed matrix S. Further, each row of data in the processed matrix S is sent to one or more of the K basic processing circuits together with the identification data corresponding to the row of data and associated with the row of data in the first mask matrix through the control circuit. When the main processing circuit sends data to the basic processing circuit, the data with absolute value larger than a preset threshold value in the processed matrix S or non-0 data can be sent to the basic processing circuit, so that the data transmission quantity is reduced.
In an alternative, if the number of rows M < = K of S, the control circuit of the main processing circuit distributes one row of the S matrix to M base processing circuits, respectively; optionally, identification data of a row corresponding to the row in the first identification matrix is also transmitted at the same time;
In one alternative, if the number of rows M > K of S, the control circuit of the main processing circuit distributes the data of one or more rows of the S matrix to each of the base processing circuits, respectively. Optionally, identification data corresponding to a row in the first identification matrix by the one or more rows is also transmitted at the same time;
In S there are Mi rows distributed to the i-th basic processing circuit, the set of which is called Ai, as fig. 2e shows the calculations to be performed on the i-th basic processing circuit.
In one alternative, in each basic processing circuit, for example, in the i-th basic processing circuit:
the received matrix Ai distributed by the main processing circuit is stored in an ith basic processing circuit register and/or an on-chip cache; the method has the advantages of reducing the subsequent data transmission quantity, improving the calculation efficiency and reducing the power consumption.
Step S202b, the control circuit of the main processing circuit transmits each part of the matrix P to each basic processing circuit in a broadcast mode;
in an alternative, the data (parts) of the matrix P may be processed data. Specifically, the main processing circuit enables the first mapping circuit to process the matrix P, so as to obtain a processed matrix P and a second identification (mask) matrix associated with the matrix P. Or the first mapping circuit of the main processing circuit processes the matrix P according to a second mask matrix associated with the pre-stored matrix P to obtain a processed matrix P. Further, the data (i.e. each part) in the processed matrix P and the identification data corresponding to the data and associated with the data in the second mask matrix are sent to one or more of the K basic processing circuits through the control circuit. When the main processing circuit sends data to the basic processing circuit, the data with absolute value larger than a preset threshold value in the processed matrix P or non-0 data can be sent to the basic processing circuit, so that the data transmission quantity is reduced.
In an alternative scheme, each part in the matrix P can be broadcast to a register or an on-chip buffer of each basic processing circuit only once, and the ith basic processing circuit sufficiently multiplexes the data of the matrix P obtained at this time to complete the inner product operation corresponding to each row in the matrix Ai; the multiplexing in this embodiment may be specifically repeated for the basic processing circuit in the calculation, for example, multiplexing of the data of the matrix P may be repeated for multiple uses of the data of the matrix P.
In an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip buffer of each basic processing circuit for multiple times, where the ith basic processing circuit does not multiplex the data of the matrix P obtained each time, and completes the inner product operation corresponding to each row in the matrix Ai in multiple times;
in an alternative scheme, the control circuit of the main processing circuit may broadcast each part of the matrix P to the register or on-chip buffer of each basic processing circuit for multiple times, and the ith basic processing circuit performs part multiplexing on the data of the matrix P obtained each time to complete inner product operation corresponding to each row in the matrix Ai;
In one alternative, each elementary processing circuit, for example the ith elementary processing circuit, calculates the inner product of the data of matrix Ai and the data of matrix P;
in step S203b, the accumulator circuit of each basic processing circuit accumulates the result of the inner product operation and transmits the accumulated result back to the main processing circuit.
Optionally, before step S203b, the inner product operator of the basic processing circuit needs to calculate the inner product of the data of the matrix S and the matrix P, which specifically includes the following embodiments.
In a specific embodiment, the basic processing circuit receives the data in the processed matrix S and the identification data associated with the data in the first mask matrix; and also receives the data in the processed matrix P. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix P according to the identification data in the received first mask matrix, and the processed data of the matrix P is obtained. Further, the basic processing circuit enables the inner product operator circuit to execute inner product operation on the received data in the processed matrix S and the data of the processed matrix P, and a result of the inner product operation is obtained.
In a specific embodiment, the basic processing circuit receives the data in the processed matrix P and the identification data associated with the data in the second mask matrix; and also receives the data in the processed matrix S. Correspondingly, the basic processing circuit enables the second mapping circuit to process the received data of the matrix S according to the identification data in the received second mask matrix, and the processed data of the matrix S is obtained. Further, the basic processing circuit enables the inner product arithmetic circuit to execute inner product operation on the received data of the processed matrix P and the data in the processed matrix S, and a result of the inner product operation is obtained.
In a specific embodiment, the basic processing circuit receives the data in the processed matrix S and the identification data associated with the data in the first mask matrix; and meanwhile, the data in the processed matrix P and the identification data associated with the data in the second mask matrix are also received. Correspondingly, the basic processing circuit enables the second mapping circuit to obtain a relation identification matrix according to the received identification data in the first mask matrix and the received identification data in the second mask matrix; and then respectively processing the received data in the matrix S and the received data in the matrix P by utilizing the identification data in the relation identification matrix to obtain the processed data of the matrix S and the processed data of the matrix P. Further, the inner product operator circuit is enabled to execute inner product operation on the data in the processed matrix S and the data of the processed matrix P, and a result of the inner product operation is obtained. For example, the ith basic processing circuit receives the matrix Ai, the identification matrix Bi associated with the Ai, the matrix P and the second identification matrix associated with the matrix P; at this time, the second mapping circuit can be started to obtain a relation identification matrix by using Bi and the second identification matrix, and then the relation identification matrix is used for processing the matrix Ai and the matrix P simultaneously or respectively to obtain the processed matrix Ai and the processed matrix P. Then, the inner product operator circuit is enabled to perform inner product operation on the processed matrix Ai and the processed matrix P.
In an alternative, the base processing circuit may accumulate the partial sum of each execution of the inner product operation and transmit it back to the main processing circuit;
in an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit each time can be stored in a register and/or an on-chip cache of the basic processing circuit, and the part is transmitted back to the main processing circuit after accumulation is finished;
In an alternative scheme, the part obtained by the inner product operation executed by the basic processing circuit and the part stored in the register and/or the on-chip buffer of the basic processing circuit in some cases are accumulated, and the part is transmitted to the main processing circuit for accumulation, and the part is transmitted back to the main processing circuit after accumulation is finished.
The present invention also provides a chip comprising a computing device comprising:
comprising a main processing circuit, the data referred to in the main processing circuit may be compressed data, in an alternative embodiment the compressed data comprising at least one input neuron or at least one weight, each neuron of the at least one neuron being larger than a first threshold value or each weight of the at least one weight being larger than a second threshold value. The first threshold and the second threshold are set by user side user definition, and can be the same or different.
In one alternative, the main processing circuit includes a first mapping circuit;
In an alternative, the main processing circuit includes an arithmetic unit that performs data compression processing, such as a vector arithmetic unit or the like;
specifically, a data input interface is included that receives input data;
In one alternative, the received data source may be: a part or all of basic processing circuits outside the neural network operation circuit device or the neural network operation circuit device;
In one alternative, there may be a plurality of said data input interfaces; in particular, a data output interface may be included that outputs data;
in one alternative, the output data may be destined for: a part or all of basic processing circuits outside the neural network operation device or the neural network operation circuit device;
in an alternative, there may be a plurality of said data output interfaces;
In one alternative, the main processing circuitry includes on-chip caches and/or registers;
in an alternative, the main processing circuit includes an operation unit, and may perform data operation;
In one alternative, the main processing circuit includes an arithmetic operation unit therein;
In an alternative, the main processing circuit includes a vector operation unit, and may perform an operation on a set of data at the same time; in particular, the arithmetic and/or vector operations may be any type of operation, including, but not limited to: two numbers are added, subtracted, multiplied and divided, one number is added, subtracted, multiplied and divided with a constant, an exponential operation, a power operation, a logarithmic operation, various nonlinear operations are performed on one number, a comparison operation, a logic operation, and the like are performed on two numbers. The addition, subtraction, multiplication and division of two vectors, the addition, subtraction, multiplication and division of each element in one vector with a constant, the execution of an exponential operation, a power operation, a logarithmic operation, various nonlinear operations, etc. on each element in one vector, the execution of a comparison operation, a logic operation, etc. on each two corresponding elements in one vector.
In an alternative, the main processing circuit includes a data rearrangement unit for transmitting data to the base processing circuit in a certain order or rearranging data in situ in a certain order;
in one alternative, the order of the data arrangement includes: carrying out dimension sequence transformation on a multidimensional data block; the order of the data arrangement may further include: one block of data is partitioned for transmission to different underlying processing circuits.
The computing device further includes a plurality of base processing circuits: each basic processing circuit is used for calculating the inner product of two vectors, the calculation method is that the basic processing circuit receives two groups of numbers, elements in the two groups of numbers are correspondingly multiplied, and the multiplied results are accumulated; the result of the inner product is transmitted out, where it may be transmitted to other basic processing circuits, or directly to the main processing circuit, depending on the location of the basic processing circuit.
The data involved in the basic processing circuit may be compressed data, which in an alternative embodiment comprises at least one input neuron or at least one weight, each of the at least one neuron being larger than a first threshold value or each of the at least one weight being larger than a second threshold value. The first threshold and the second threshold are set by user side user definition, and can be the same or different.
In one alternative, the base processing circuit includes a second mapping circuit;
in one alternative, the base processing circuit includes a vector operation unit that performs data compression processing;
specifically, the memory unit comprises an on-chip cache and/or a register;
Specifically, the system comprises one or more data input interfaces for receiving data;
In one alternative, two data input interfaces are included, from which one or more data may be obtained at a time, respectively;
in one alternative, the underlying processing circuitry may store input data received from the data input interface in registers and/or on-chip caches;
The source of the data received by the data input interface may be: other basic processing circuitry and/or main processing circuitry.
A main processing circuit of the neural network operation circuit device;
Other basic processing circuits of the neural network operation circuit device (the neural network operation circuit device has a plurality of basic processing circuits);
specifically, the system comprises one or more data output interfaces for transmitting output data;
in one alternative, one or more data may be transmitted from the data output interface;
Specifically, the data transmitted through the data output interface may be: data received from the data input interface, data stored in on-chip caches and/or registers, multiplier operation results, accumulator operation results, or inner product operation results, or any combination thereof.
In an alternative scheme, the device comprises three data output interfaces, two of which respectively correspond to two data input interfaces, each layer outputs data received from the data input interface at the previous layer, and the third data output interface is responsible for outputting operation results;
specifically, the data output interface may transmit data to: the above data sources and the data destinations herein determine the connection of the underlying processing circuitry in the device.
A main processing circuit of the neural network operation circuit device;
The neural network operation circuit device is provided with a plurality of basic processing circuits;
specifically, the arithmetic operation circuit is included: the arithmetic operation circuit may specifically be: one or more multiplier circuits, one or more accumulator circuits, one or more circuits that perform two sets of digital inner product operations, or any combination thereof.
In an alternative, a two-number multiplication may be performed, the results of which may be stored on-chip caches and/or registers, or may be accumulated directly into the registers and/or on-chip caches;
In an alternative, an inner product operation of two sets of data may be performed, and the result may be stored in an on-chip buffer and/or a register, or may be directly accumulated in the register and/or the on-chip buffer;
In one alternative, an accumulation operation of the data may be performed, accumulating the data into an on-chip cache and/or register;
specifically, the data accumulated by the accumulator circuit may be: data received from the data input interface, data stored in on-chip caches and/or registers, multiplier operation results, accumulator operation results, inner product operation results, or any combination thereof.
It should be noted that, as used in the above description of the basic processing circuits, "data input interface" and "data output interface" refer to the data input and output interface of each basic processing circuit, and not the data input and output interface of the entire apparatus.
In one embodiment, the invention discloses a neural network computing device, which comprises functional units for executing all or part of the implementation manners provided in the method embodiment.
In one embodiment, the invention discloses a chip for performing all or part of the implementation manner provided in the method embodiment described above.
In one embodiment, the invention discloses an electronic device comprising functional units for performing all or part of the implementation of the method embodiments described above.
The electronic device includes a data processing device, a robot, a computer, a printer, a scanner, a tablet, an intelligent terminal, a cell phone, a vehicle recorder, a navigator, a sensor, a camera, a server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a vehicle, a household appliance, and/or a medical device.
The vehicle comprises an aircraft, a ship and/or a vehicle; the household appliances comprise televisions, air conditioners, microwave ovens, refrigerators, electric cookers, humidifiers, washing machines, electric lamps, gas cookers and range hoods; the medical device includes a nuclear magnetic resonance apparatus, a B-mode ultrasonic apparatus, and/or an electrocardiograph apparatus.
While the foregoing is directed to embodiments of the present disclosure, other and further details of the invention may be had by the present disclosure, it is to be understood that the foregoing description is merely illustrative of the present disclosure and that no changes, substitutions, alterations, etc. that may be made without departing from the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (17)

1. An integrated circuit chip device, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits; the main processing circuit comprises a first mapping circuit, at least one circuit of the plurality of basic processing circuits comprises a second mapping circuit, and the first mapping circuit and the second mapping circuit are used for executing compression processing of each data in the neural network operation;
The main processing circuit is used for acquiring an input data block, a weight data block and a multiplication instruction, dividing the input data block into divided data blocks according to the multiplication instruction, and dividing the weight data block into broadcast data blocks; determining to start a first mapping circuit to process the first data block according to the operation control of the multiplication instruction, and obtaining a processed first data block; the first data block comprises the distribution data block and/or the broadcast data block; transmitting the processed first data block to at least one basic processing circuit in basic processing circuits connected with the main processing circuit according to the multiplication instruction;
The basic processing circuits are used for determining whether to start the second mapping circuit to process the second data block according to the operation control of the multiplication instruction, executing the operation in the neural network in a parallel mode according to the processed second data block to obtain an operation result, and transmitting the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit; the second data block is determined by the basic processing circuit and used for receiving the data block sent by the main processing circuit, and the second data block is associated with the processed first data block;
the main processing circuit is used for processing the operation result to obtain an instruction result of the multiplication instruction;
The basic processing circuit is specifically configured to perform a product operation on a basic data block and the broadcast data block to obtain a product result, accumulate the product result to obtain an operation result, and send the operation result to the main processing circuit;
The main processing circuit is used for obtaining an accumulation result after accumulating the operation results, and arranging the accumulation result to obtain the instruction result;
The input data block is: vectors or matrices;
the weight data block is: vector or matrix.
2. The integrated circuit chip device of claim 1, wherein,
The main processing circuit is specifically configured to broadcast the broadcast data block or the processed broadcast data block to the plurality of basic processing circuits at a time; or alternatively
The main processing circuit is specifically configured to divide the broadcast data block or the processed broadcast data block into a plurality of partial broadcast data blocks, and broadcast the plurality of partial broadcast data blocks to the plurality of basic processing circuits through a plurality of times.
3. The integrated circuit chip device of claim 1, wherein,
The main processing circuit is specifically configured to split the processed broadcast data block and the identification data block associated with the broadcast data block to obtain a plurality of partial broadcast data blocks and identification data blocks associated with the partial broadcast data blocks; broadcasting the plurality of partial broadcast data blocks and the identification data blocks respectively associated with the plurality of partial broadcast data blocks to the basic processing circuit through one or more times; the plurality of partial broadcast data blocks are combined to form the processed broadcast data block;
The basic processing circuit is specifically configured to start the second mapping circuit to obtain a connection identifier data block according to the identifier data block associated with the partial broadcast data block and the identifier data block associated with the basic data block; processing the partial broadcast data block and the basic data block according to the connection identification data block to obtain a processed broadcast data block and a processed basic data block; performing product operation on the processed broadcast data block and the processed basic data block;
Or the basic processing circuit is specifically configured to start the second mapping circuit to process the basic data block according to the identification data block associated with the partial broadcast data block to obtain a processed basic data block, and perform product operation on the processed basic data and the partial broadcast data block.
4. The integrated circuit chip device of claim 3, wherein,
The basic processing circuit is specifically configured to perform a product on the part of broadcast data block and the basic data block to obtain a product result, accumulate the product result to obtain a part of operation result, and send the part of operation result to the main processing circuit; or alternatively
The basic processing circuit is specifically configured to multiplex n times of the partial broadcast data blocks to perform inner product operations of the partial broadcast data blocks and n times of the basic data blocks to obtain n partial processing results, respectively accumulate the n partial processing results to obtain n partial operation results, and send the n partial operation results to the main processing circuit, where n is an integer greater than or equal to 2.
5. The integrated circuit chip device of claim 1, wherein the integrated circuit chip device further comprises: a branch processing circuit disposed between the main processing circuit and the at least one base processing circuit;
The branch processing circuit is configured to forward data between the main processing circuit and the at least one base processing circuit.
6. The integrated circuit chip device of any one of claims 1-5, wherein when the first data block comprises a distribution data block and a broadcast data block,
The main processing circuit is specifically configured to start the first mapping circuit to process the distribution data block and the broadcast data block to obtain a processed distribution data block, an identification data block associated with the distribution data block, a processed broadcast data block, and an identification data block associated with the broadcast data block; splitting the processed distribution data block and the identification data block associated with the distribution data block to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the plurality of basic data blocks and the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks and the identification data blocks associated with the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;
The basic processing circuit is used for starting the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the basic data block and the identification data block associated with the broadcast data block; and processing the basic data block and the broadcast data block according to the connection identification data block, performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and transmitting the operation result to the main processing circuit.
7. The integrated circuit chip device of any one of claims 1-5, wherein when the first data block comprises a distribution data block,
The main processing circuit is specifically configured to start the first mapping circuit to process the distribution data block to obtain a processed distribution data block and an identification data block associated with the distribution data block, or start the first mapping circuit to process the distribution data block according to a pre-stored identification data block associated with the distribution data block to obtain a processed distribution data block; splitting the processed distributed data blocks and the identification data blocks associated with the distributed data blocks to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the identification data blocks; broadcasting the broadcast data block to a base processing circuit connected thereto;
The basic processing circuit is used for starting the second mapping circuit to process the broadcast data block according to the identification data block related to the basic data block, performing product operation on the processed broadcast data block and the basic data block to obtain an operation result, and sending the operation result to the main processing circuit.
8. The integrated circuit chip device of any one of claims 1-5, wherein when the first data block comprises a broadcast data block,
The main processing circuit is specifically configured to start the first mapping circuit to process the broadcast data block to obtain a processed broadcast data block and an identification data block associated with the broadcast data block, or start the first mapping circuit to process the broadcast data block according to a pre-stored identification data block associated with the broadcast data block to obtain a processed broadcast data block; splitting the distributed data blocks to obtain a plurality of basic data blocks; distributing the plurality of basic data to a base processing circuit connected thereto; broadcasting the processed broadcast data block and the identification data block associated with the broadcast data block to a basic processing circuit connected with the broadcast data block;
The basic processing circuit is used for starting the second mapping circuit to process the basic data block according to the identification data block associated with the broadcast data block to obtain a processed basic data block; and performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and sending the operation result to the main processing circuit.
9. A method of operation of a neural network, the method being applied to an integrated circuit chip device, the integrated circuit chip device comprising: a main processing circuit and a plurality of basic processing circuits, the main processing circuit comprising a first mapping circuit, at least one of the plurality of basic processing circuits comprising a second mapping circuit, the method comprising:
the first mapping circuit and the second mapping circuit execute compression processing of each data in the neural network operation;
The main processing circuit acquires an input data block, a weight data block and a multiplication instruction, divides the input data block into divided data blocks according to the multiplication instruction, and divides the weight data block into broadcast data blocks; determining to start a first mapping circuit to process the first data block according to the operation control of the multiplication instruction, and obtaining a processed first data block; the first data block comprises the distribution data block and/or the broadcast data block; transmitting the processed first data block to at least one basic processing circuit in basic processing circuits connected with the main processing circuit according to the multiplication instruction;
The plurality of basic processing circuits determine whether to start a second mapping circuit to process a second data block according to the operation control of the multiplication instruction, execute the operation in the neural network in a parallel mode according to the processed second data block to obtain an operation result, and transmit the operation result to the main processing circuit through the basic processing circuit connected with the main processing circuit; the second data block is determined by the basic processing circuit and used for receiving the data block sent by the main processing circuit, and the second data block is associated with the processed first data block;
The main processing circuit processes the operation result to obtain an instruction result of the multiplication instruction;
The basic processing circuit performs product operation on the basic data block and the broadcast data block to obtain a product result, accumulates the product result to obtain an operation result, and sends the operation result to the main processing circuit;
after the operation results are accumulated, an accumulated result is obtained, and the main processing circuit arranges the accumulated result to obtain the instruction result;
The input data block is: vectors or matrices;
the weight data block is: vector or matrix.
10. The method of claim 9, wherein the step of determining the position of the substrate comprises,
The main processing circuit broadcasts the broadcast data block or the processed broadcast data block to the plurality of basic processing circuits at one time; or alternatively
The main processing circuit divides the broadcast data block or the processed broadcast data block into a plurality of partial broadcast data blocks, and broadcasts the plurality of partial broadcast data blocks to the plurality of basic processing circuits through a plurality of times.
11. The method of claim 9, wherein the step of determining the position of the substrate comprises,
The main processing circuit splits the processed broadcast data blocks and the identification data blocks associated with the broadcast data blocks to obtain a plurality of partial broadcast data blocks and the identification data blocks respectively associated with the partial broadcast data blocks; broadcasting the plurality of partial broadcast data blocks and the identification data blocks respectively associated with the plurality of partial broadcast data blocks to the basic processing circuit through one or more times; the plurality of partial broadcast data blocks are combined to form the processed broadcast data block;
The basic processing circuit starts the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the partial broadcast data block and the identification data block associated with the basic data block; processing the partial broadcast data block and the basic data block according to the connection identification data block to obtain a processed broadcast data block and a processed basic data block; performing product operation on the processed broadcast data block and the processed basic data block;
Or the basic processing circuit starts the second mapping circuit to process the basic data block according to the identification data block associated with the partial broadcast data block to obtain a processed basic data block, and executes product operation on the processed basic data and the partial broadcast data block.
12. The method of claim 11, wherein the step of determining the position of the probe is performed,
The basic processing circuit performs one-time multiplication on the part of broadcast data blocks and the basic data blocks to obtain a multiplication result, accumulates the multiplication result to obtain a part of operation result, and sends the part of operation result to the main processing circuit; or alternatively
The basic processing circuit multiplexes the n partial broadcast data blocks to execute inner product operation of the partial broadcast data blocks and the n basic data blocks to obtain n partial processing results, respectively accumulating the n partial processing results to obtain n partial operation results, and transmitting the n partial operation results to the main processing circuit, wherein n is an integer greater than or equal to 2.
13. The method of claim 9, wherein the integrated circuit chip device further comprises: a branch processing circuit disposed between the main processing circuit and the at least one base processing circuit;
The branch processing circuit forwards data between the main processing circuit and at least one base processing circuit.
14. The method according to any of claims 9-13, wherein when the first data block comprises a distribution data block and a broadcast data block,
The main processing circuit starts the first mapping circuit to process the distribution data block and the broadcast data block to obtain a processed distribution data block and an identification data block related to the distribution data block, a processed broadcast data block and an identification data block related to the broadcast data block; splitting the processed distribution data block and the identification data block associated with the distribution data block to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the plurality of basic data blocks and the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the basic data blocks, and broadcasting the broadcast data blocks and the identification data blocks associated with the broadcast data blocks to the basic processing circuit connected with the basic processing circuit;
The basic processing circuit starts the second mapping circuit to obtain a connection identification data block according to the identification data block associated with the basic data block and the identification data block associated with the broadcast data block; and processing the basic data block and the broadcast data block according to the connection identification data block, performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and transmitting the operation result to the main processing circuit.
15. The method according to any of claims 9-13, wherein, when the first data block comprises a distribution data block,
The main processing circuit starts the first mapping circuit to process the distribution data block to obtain a processed distribution data block and an identification data block related to the distribution data block, or starts the first mapping circuit to process the distribution data block according to the pre-stored identification data block related to the distribution data block to obtain a processed distribution data block; splitting the processed distributed data blocks and the identification data blocks associated with the distributed data blocks to obtain a plurality of basic data blocks and the identification data blocks respectively associated with the basic data blocks; distributing the identification data blocks respectively associated with the plurality of basic data blocks to a basic processing circuit connected with the identification data blocks; broadcasting the broadcast data block to a base processing circuit connected thereto;
the basic processing circuit starts the second mapping circuit to process the broadcast data block according to the identification data block related to the basic data block, performs product operation on the processed broadcast data block and the basic data block to obtain an operation result, and sends the operation result to the main processing circuit.
16. The method according to any of claims 9-13, wherein, when the first data block comprises a broadcast data block,
The main processing circuit starts the first mapping circuit to process the broadcast data block to obtain a processed broadcast data block and an identification data block related to the broadcast data block, or starts the first mapping circuit to process the broadcast data block according to a pre-stored identification data block related to the broadcast data block to obtain a processed broadcast data block; splitting the distributed data blocks to obtain a plurality of basic data blocks; distributing the plurality of basic data to a base processing circuit connected thereto; broadcasting the processed broadcast data block and the identification data block associated with the broadcast data block to a basic processing circuit connected with the broadcast data block;
The basic processing circuit starts the second mapping circuit to process the basic data block according to the identification data block associated with the broadcast data block to obtain a processed basic data block; and performing product operation on the processed basic data block and the processed broadcast data block to obtain an operation result, and sending the operation result to the main processing circuit.
17. A chip, characterized in that the chip is integrated with the device according to any of claims 1-8.
CN202010617209.8A 2018-02-27 2018-02-27 Integrated circuit chip device and related products Active CN111767998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010617209.8A CN111767998B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related products

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010617209.8A CN111767998B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related products
CN201810164317.7A CN110197268B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201810164317.7A Division CN110197268B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Publications (2)

Publication Number Publication Date
CN111767998A CN111767998A (en) 2020-10-13
CN111767998B true CN111767998B (en) 2024-05-14

Family

ID=67751070

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010617209.8A Active CN111767998B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related products
CN201810164317.7A Active CN110197268B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201810164317.7A Active CN110197268B (en) 2018-02-27 2018-02-27 Integrated circuit chip device and related product

Country Status (1)

Country Link
CN (2) CN111767998B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111767998B (en) * 2018-02-27 2024-05-14 上海寒武纪信息科技有限公司 Integrated circuit chip device and related products
CN117974417A (en) * 2024-03-28 2024-05-03 腾讯科技(深圳)有限公司 AI chip, electronic device, and image processing method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844330A (en) * 2016-03-22 2016-08-10 华为技术有限公司 Data processing method of neural network processor and neural network processor
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
WO2017185418A1 (en) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 Device and method for performing neural network computation and matrix/vector computation
CN107316078A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 Apparatus and method for performing artificial neural network self study computing
CN107609641A (en) * 2017-08-30 2018-01-19 清华大学 Sparse neural network framework and its implementation
CN110197268B (en) * 2018-02-27 2020-08-04 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463324A (en) * 2014-11-21 2015-03-25 长沙马沙电子科技有限公司 Convolution neural network parallel processing method based on large-scale high-performance cluster
CN106598545B (en) * 2015-10-08 2020-04-14 上海兆芯集成电路有限公司 Processor and method for communicating shared resources and non-transitory computer usable medium
CN106991478B (en) * 2016-01-20 2020-05-08 中科寒武纪科技股份有限公司 Apparatus and method for performing artificial neural network reverse training
CN106126481B (en) * 2016-06-29 2019-04-12 华为技术有限公司 A kind of computing system and electronic equipment
CN107301456B (en) * 2017-05-26 2020-05-12 中国人民解放军国防科学技术大学 Deep neural network multi-core acceleration implementation method based on vector processor

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105844330A (en) * 2016-03-22 2016-08-10 华为技术有限公司 Data processing method of neural network processor and neural network processor
CN107316078A (en) * 2016-04-27 2017-11-03 北京中科寒武纪科技有限公司 Apparatus and method for performing artificial neural network self study computing
WO2017185418A1 (en) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 Device and method for performing neural network computation and matrix/vector computation
CN106447034A (en) * 2016-10-27 2017-02-22 中国科学院计算技术研究所 Neutral network processor based on data compression, design method and chip
CN107609641A (en) * 2017-08-30 2018-01-19 清华大学 Sparse neural network framework and its implementation
CN110197268B (en) * 2018-02-27 2020-08-04 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
采用BP神经网络的通用数据压缩方案;薛辉;吴跃;刘小双;章毅;;微计算机信息(第25期);全文 *

Also Published As

Publication number Publication date
CN110197268B (en) 2020-08-04
CN110197268A (en) 2019-09-03
CN111767998A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN110197270B (en) Integrated circuit chip device and related product
CN110909872B (en) Integrated circuit chip device and related products
CN109993291B (en) Integrated circuit chip device and related product
CN111767998B (en) Integrated circuit chip device and related products
CN111160541B (en) Integrated circuit chip device and related products
CN109993292B (en) Integrated circuit chip device and related product
CN110197274B (en) Integrated circuit chip device and related product
CN111767996B (en) Integrated circuit chip device and related products
CN110197271B (en) Integrated circuit chip device and related product
CN109993290B (en) Integrated circuit chip device and related product
CN111091189B (en) Integrated circuit chip device and related products
US11704544B2 (en) Integrated circuit chip device and related product
CN110197266B (en) Integrated circuit chip device and related product
CN110197263B (en) Integrated circuit chip device and related product
CN110197275B (en) Integrated circuit chip device and related product
CN110197265B (en) Integrated circuit chip device and related product
CN110197267B (en) Neural network processor board card and related product
CN111767997B (en) Integrated circuit chip device and related products
CN110197273B (en) Integrated circuit chip device and related product
US11734548B2 (en) Integrated circuit chip device and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant