WO2021082722A1 - Dispositif et procédé de calcul, et produit associé - Google Patents

Dispositif et procédé de calcul, et produit associé Download PDF

Info

Publication number
WO2021082722A1
WO2021082722A1 PCT/CN2020/113160 CN2020113160W WO2021082722A1 WO 2021082722 A1 WO2021082722 A1 WO 2021082722A1 CN 2020113160 W CN2020113160 W CN 2020113160W WO 2021082722 A1 WO2021082722 A1 WO 2021082722A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
transformation
sub
result
unit
Prior art date
Application number
PCT/CN2020/113160
Other languages
English (en)
Chinese (zh)
Inventor
张英男
曾洪博
张尧
刘少礼
黄迪
周诗怡
张曦珊
刘畅
郭家明
高钰峰
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2021082722A1 publication Critical patent/WO2021082722A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a computing device, method and related products.
  • the popular deep learning network contains a large number of matrix multiplication operations.
  • the process is to multiply the matrix w and the vector x, add the vector b, and then perform the activation function operation on the obtained vector (that is, perform the activation function operation on each element in the matrix).
  • the complexity of the matrix multiplication vector operation is much higher than the subsequent addition of offset and activation operations, and the efficient realization of the former has the most important impact on the entire operation process.
  • the multiplication operation cost of the original convolution is large, which makes the energy efficiency ratio of the deep learning network low and the operation time is long.
  • the present disclosure provides a computing device, a method, and related products to improve the energy efficiency ratio and computing speed of a deep learning network on a hardware architecture, and improve the performance of the deep learning network.
  • the present disclosure provides an arithmetic device, including: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;
  • the master control unit is configured to send a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform a positive transformation operation in a winograd convolution operation, and instruct the slave control unit to send a second control Instruction, the second control instruction is used to instruct the slave arithmetic unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;
  • the storage unit is used to store data used for winograd convolution operation
  • the main arithmetic unit is configured to respond to the first control instruction and extract data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is split
  • the solution is a summation operation
  • the slave operation unit is configured to respond to the second control instruction, obtain the result of the forward transformation operation from the main operation unit, extract data from the storage unit, and perform multiplication and addition operations in the winograd convolution operation And the inverse transformation operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.
  • the present disclosure provides a neural network computing device.
  • the neural network computing device includes one or more computing devices as described in the first aspect for obtaining data to be computed and controlling from other processing devices. Information, execute the specified neural network operation, and transfer the execution result to other processing devices through the I/O interface;
  • the neural network computing device includes multiple computing devices, the multiple computing devices are connected through a specific structure and transmit data;
  • a plurality of said arithmetic devices are interconnected and transmit data through a fast external device interconnection bus to support larger-scale neural network operations; a plurality of said arithmetic devices share the same control system or have their own control systems; The two computing devices share memory or have their own memory; the interconnection mode of the multiple computing devices is an arbitrary interconnection topology.
  • the present disclosure provides an artificial intelligence chip including the computing device according to any one of the first aspect.
  • the present disclosure provides an electronic device including the artificial intelligence chip as described in the third aspect.
  • the present disclosure provides a board card, the board card includes: a storage device, an interface device, a control device, and the artificial intelligence chip as described in the third aspect;
  • the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
  • the storage device is used to store data
  • the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the present disclosure provides an arithmetic method applied to a computing device, the arithmetic device comprising: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the method includes:
  • the master control unit sends a first control instruction, and the first control instruction is used to instruct the master arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, so The second control instruction is used to instruct the slave operation unit to perform multiplication and addition operations and inverse transformation operations in a winograd convolution operation;
  • the storage unit stores data used for winograd convolution operation
  • the main arithmetic unit extracts data from the storage unit to perform a positive transformation operation in a winograd convolution operation to obtain a positive transformation operation result; wherein the forward transformation operation is decomposed into Sum operation
  • the slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation in the winograd convolution operation Operation to obtain the winograd convolution operation result; wherein, the inverse transformation operation is disassembled into a summation operation.
  • the arithmetic device, method and related products provided by the present disclosure are provided with a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit in the arithmetic device;
  • the master control unit sends a first control instruction, a first control instruction Used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave arithmetic unit to perform the multiply-add operation and inverse in the winograd convolution operation Transformation operation;
  • the storage unit stores data used for winograd convolution operation;
  • the main operation unit responds to the first control instruction, extracts data from the storage unit and performs the positive transformation operation in the winograd convolution operation to obtain the result of the positive transformation operation;
  • the transformation operation is broken down into a summation operation;
  • the slave operation unit responds to the second control instruction,
  • the inverse transformation operation obtains the winograd convolution operation result; among them, the inverse transformation operation is disassembled into a summation operation. This can effectively improve the energy efficiency ratio and computing speed of the deep learning network on the hardware architecture, and improve the performance of the deep learning network.
  • the present disclosure can be applied to the following (including but not limited to) scenarios: data processing, robots, computers, printers, scanners, phones, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, cloud servers , Cameras, camcorders, projectors, watches, earphones, mobile storage, wearable devices and other electronic products; aircraft, ships, vehicles and other transportation tools; TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, Electric lights, gas stoves, range hoods and other household appliances; and various medical equipment including nuclear magnetic resonance, B-ultrasound, electrocardiograph, etc.
  • FIG. 1 is a schematic diagram of the structure of an arithmetic device provided by an embodiment of the application
  • FIG. 2 is a schematic structural diagram of a computing device provided by another embodiment of the application.
  • FIG. 3 is a flow sequence diagram of a computing device provided by another embodiment of this application.
  • FIG. 4 is a schematic flowchart of an operation method provided by another embodiment of this application.
  • FIG. 5 is a schematic diagram of a processing system of an arithmetic method according to an embodiment of the present disclosure
  • Fig. 6 is a structural block diagram of a board according to an embodiment of the present disclosure.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Convolution operation refers to opening an active window with the same size as the template from the upper left corner of the image.
  • the active window corresponds to a window image, which is the convolution kernel
  • the window image corresponds to the pixels in the image Multiply and add, and use the calculation result as the first pixel value of the new image after the convolution operation.
  • the active window moves one column to the right, the window image corresponding to the active window and the pixels in the image are multiplied and then added together, and the calculation result is used as the second pixel value of the new image after the convolution operation.
  • Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: the first target matrix and the second target matrix are respectively subjected to winograd convolution positive transformation, and then the first target matrix and the second target matrix after the positive transformation are subjected to bitwise multiplication, and finally The winograd convolution inverse transformation is performed again on the result of the bit multiplication, and the convolution result equivalent to the original convolution operation is obtained.
  • Convolutional neural network model is a type of feedforward neural network model that includes convolution calculation and has a deep structure, and is one of the representative models of deep learning. In the convolutional layer, fully connected layer and other network layers in the convolutional neural network model, it is necessary to perform convolution operations on neurons and convolution kernels to obtain feature data.
  • Figure 1 is a schematic structural diagram of a computing device provided by an embodiment of this application.
  • the device in this embodiment may include: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit ;
  • the main control unit is used to send a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used To instruct the slave operation unit to perform multiplication and addition operations and inverse transformation operations in the winograd convolution operation;
  • the storage unit is used to store data used for the winograd convolution operation;
  • the main operation unit is used to respond to the first control instruction, and the slave storage unit
  • the data is extracted from the winograd convolution operation and the positive transformation operation is performed to obtain the positive transformation operation result; the positive transformation operation is disassembled into a summation operation; the slave operation
  • the main control unit and the slave control unit receive the arithmetic control signal, and decode the arithmetic control signal to obtain the corresponding first control instruction and second control instruction, and then the master control unit transfers the first control instruction Send to the main arithmetic unit, send the second control instruction to the slave arithmetic unit; the master arithmetic unit responds to the first control instruction, extracts data from the storage unit and performs the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; In response to the second control instruction, the arithmetic unit obtains the result of the forward transformation operation from the main arithmetic unit, and extracts data from the storage unit, and performs multiplication and addition operations and inverse transformation operations in the winograd convolution operation to obtain the winograd convolution operation result.
  • the arithmetic device is provided with a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;
  • the master control unit sends a first control instruction, and the first control instruction is used to instruct the master arithmetic
  • the unit performs the forward transformation operation in the winograd convolution operation, and instructs the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave operation unit to perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation;
  • storage unit Store data used for winograd convolution operation;
  • the main arithmetic unit responds to the first control instruction, extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation, and obtains the result of the positive transformation operation;
  • the slave arithmetic unit responds to the second control instruction , Obtain the result of the forward transformation operation from the main operation unit, and extract the data from the storage
  • the inverse transformation operation of the slave arithmetic unit is disassembled into a summation operation; this can effectively improve the energy efficiency ratio and calculation speed of the deep learning network on the hardware architecture, and improve The performance of the deep learning network.
  • FIG. 2 is a schematic structural diagram of an arithmetic device provided by another embodiment of the application.
  • the device in this embodiment may include: a master control unit, a slave control unit, a master storage unit, and a slave storage unit master arithmetic unit , And from the computing unit.
  • the main storage unit is used to receive and store the characteristic data; the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main arithmetic unit obtains the feature data from the main storage unit, and disassembles the feature data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the feature data according to the result of the summation operation.
  • the data is being transformed.
  • the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
  • the main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as
  • the coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.
  • the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left-multiplying matrix, and the right side by the right-multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
  • the slave operation unit may also store the winograd convolution operation result in the preset address space of the storage unit. In this way, the directional storage of the calculation results can be realized, and the storage space can be fully utilized.
  • the main arithmetic unit is communicatively connected with a plurality of slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations.
  • This design method can improve computing power and computing efficiency, and achieve asynchronous computing.
  • the slave arithmetic unit obtains the positive transformation data of the weight data from the storage unit, and then performs bitwise multiplication on the positive transformation data of the feature data and the positive transformation data of the weight data to obtain the multiplication result;
  • the operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.
  • the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.
  • the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor is obtained from the arithmetic unit, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1;
  • the element value of 0 is used as the coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.
  • the main storage unit is used to receive and store the characteristic data; the slave storage unit is used to receive and store the weight data.
  • the main arithmetic unit obtains the feature data from the main storage unit, and disassembles the feature data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the feature data according to the result of the summation operation The positive transformation data.
  • the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
  • the main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as
  • the coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.
  • the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where ,
  • the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
  • the slave computing unit obtains the weight data from the storage unit, and disassembles the weight data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains according to the result of the summation operation
  • the positive transformation data of the weight data Perform bitwise multiplication on the positive transformation data of the feature data and the positive transformation data of the weight data from the arithmetic unit to obtain the result of the multiplication operation; disassemble the result of the multiplication operation into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and combine Sum and get the result of winograd convolution operation.
  • the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.
  • the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor from the arithmetic unit, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as
  • the coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.
  • the left side of the sub-tensor corresponding to the sub-tensor is multiplied by the left multiplying matrix, and the right side is multiplied by the right multiplying matrix to obtain the winograd transformation result of the sub-tensor, where ,
  • the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
  • the main storage unit is used to receive and store the characteristic data and weight data; the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main arithmetic unit obtains the feature data from the main storage unit, and disassembles the feature data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the feature data according to the result of the summation operation The positive transformation data.
  • the main arithmetic unit parses the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
  • the main arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1;
  • the element value of 0 is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.
  • the main arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left-multiplying matrix, and the right side by the right-multiplying matrix to obtain the winograd transformation result of the sub-tensor, where the left multiplying matrix and The right multiplication matrix is determined by the scale of the subtensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
  • the main arithmetic unit disassembles the weight data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the positive transformation data of the weight data according to the result of the summation operation;
  • the positive conversion data of the data is sent to the slave storage unit.
  • the slave operation unit obtains the positive transformation data of the weight data from the storage unit, and performs bitwise multiplication of the positive transformation data of the feature data and the positive transformation data of the weight data to obtain the multiplication result; disassemble the multiplication result For multiple sub-tensors; transform and sum multiple sub-tensors to obtain the winograd convolution operation result.
  • the multiplication operation result is analyzed from the operation unit to obtain multiple sub-tensors, where the multiplication operation result is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.
  • the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor from the arithmetic unit, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the non-zero element values in the sub-tensor are taken as
  • the coefficient is multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.
  • the main operation unit includes: a main processing module and a cache; the main processing module is used to respond to the first control instruction and extract data from the storage unit to perform the positive transformation operation in the winograd convolution operation , Get the result of the positive transformation operation; cache, used to store the result of the positive transformation operation.
  • the positive transformation operation result may be sent to the slave operation unit when the positive transformation operation result stored in the cache has accumulated to a preset number.
  • This design method can integrate the processing data of the slave arithmetic unit and avoid the slave arithmetic unit from being in the arithmetic state all the time.
  • the operation process of the main operation unit and the slave operation unit are parallel operations.
  • the slave operation unit focuses on the calculated feature data.
  • the bitwise multiplication is performed on the element position of the positive transformation data of the forward transformation data and the element position of the forward transformation data of the corresponding weight data, until the alignment multiplication value of each element position is calculated, and the multiplication result is obtained.
  • the feature data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel; the weight stored in the storage unit
  • the forward transformation data of the data is divided into a plurality of second data used for winograd convolution operation, the size of the second data is determined according to the size of the first data;
  • the main processing module of the main operation unit responds to the first data sent by the main control unit
  • the control instruction is to sequentially obtain the first data from the main storage unit, perform a forward transformation operation on the first data to obtain the positive transformation result of the first data, and store the positive transformation result of the first data in the cache; when the first data in the cache is When the positive conversion result of a data reaches the preset number, the main processing module of the main arithmetic unit sends the positive conversion result of the first data in the buffer to the slave arithmetic unit in turn; the slave arithmetic unit responds to the second control instruction sent
  • Fig. 3 is a flow sequence diagram of an arithmetic device provided by another embodiment of the application; as shown in Fig. 3, the slave functional unit and the main processing module are the arithmetic modules; the cache, the main processing memory, and the slave processing memory are the storage modules.
  • the input data pre-stored in the main processing memory and the slave processing memory are data after segmentation.
  • n represents the number of weight data blocks divided along the Cout direction
  • res(0,0)... (1,0) represents the output calculation result.
  • bd(i,j) represents the i-th data block along the height and width directions after bottom_data segmentation, and the j-th data block along the Cin direction;
  • the size of the data block is 16*16*512bit, that is, the size of syn_reuse_iter kernels
  • Syn_reuse_iter represents the number of times of weight reuse, and kernel represents the most basic unit of winograd transformation.
  • Wino_bd(i,j) represents the bottom_data block of the winograd domain after winograd transformation.
  • Wino_w(i,j) represents the i-th data block in the Cout direction of the transformed winograd domain weight and the j-th data block in the Cin direction, and the data block size is 64*16*512bit, which corresponds to the characteristic data block, the total data
  • bd(i,j) means that the bottom_data transformation result Wino_bd(i1,j1) is obtained after the calculation of the main processing module, and then the transformation result is sent to the slave functional unit, and the corresponding Wino_w(i2,j2) is used for bitwise multiplication. And the inverse transform operation to get the result of the operation.
  • buffering the To slave functional unit means that the buffer sends data blocks to the slave functional unit. From the processing memory to the slave functional unit, it means that the data block is sent from the processing memory to the slave functional unit. Therefore, when the To slave functional unit is cached and the To slave functional unit is processed from the memory, if the number of the sent data block changes, the relevant data multiplexing relationship will be reflected. In addition, when the slave functional unit sends data blocks to the main processing memory, it is sent at intervals, that is, the data that can be accumulated in the Cin direction will be temporarily stored in the slave functional unit.
  • the lowest dimension of the cycle is Cin, that is, the transformed block is calculated along the direction of Cin.
  • the output is not performed immediately when the output result is buffered from the functional unit, but output is performed after the accumulation of the Cin direction is completed.
  • the result of the bottom_data transformation of the main functional unit does not need to be sent to the slave functional unit after the feature data block has been transformed.
  • the result of the inverse transformation from the functional unit can also be output directly after part of the operation is completed, without waiting for the result of the entire kernel to be output after the transformation is completed.
  • the above-mentioned design method can reduce the delay of data operation and increase the operation speed.
  • the arithmetic device is provided with a master control unit, a slave control unit, a master storage unit, a slave storage unit, a master arithmetic unit, and a slave arithmetic unit;
  • the master control unit sends the first control instruction, the first control instruction Used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and instruct the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave arithmetic unit to perform the multiply-add operation and inverse in the winograd convolution operation Transformation operation;
  • the storage unit stores data used for winograd convolution operation;
  • the main operation unit responds to the first control instruction and extracts data from the main storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the result of the positive transformation operation;
  • the unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and
  • the inverse transformation operation of the slave arithmetic unit is disassembled into a summation operation; this can effectively improve the energy efficiency ratio and calculation speed of the deep learning network on the hardware architecture, and improve The performance of the deep learning network.
  • the present disclosure provides a neural network computing device.
  • the neural network computing device includes one or more computing devices as shown in Figures 1 and 2, which are used to obtain data and control information to be computed from other processing devices, and execute the specified neural network. Network operations, the execution results are transmitted to other processing devices through the I/O interface; when the neural network computing device includes multiple computing devices, the multiple computing devices are connected through a specific structure and transmit data; among them, multiple computing devices pass through the fast
  • the external device interconnection bus interconnects and transmits data to support larger-scale neural network operations; multiple computing devices share the same control system or have their own control systems; multiple computing devices share memory or have their own memory; more
  • the interconnection mode of the arithmetic devices is arbitrary interconnection topology.
  • FIG. 4 is a schematic flowchart of an operation method provided by another embodiment of this application. As shown in FIG. 4, the method in this embodiment is applied to the operation device shown in FIG. 1 and FIG. 2, and the method may include:
  • Step S101 the main control unit sends a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used for Instructs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation from the arithmetic unit.
  • Step S102 In response to the first control instruction, the main arithmetic unit extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein the positive transformation operation is disassembled into a summation operation.
  • Step S103 The slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd volume
  • the result of the product operation; among them, the inverse transformation operation is broken down into a summation operation.
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data
  • the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main arithmetic unit is specifically used to disassemble the feature data into multiple sub-tensors; perform transformation operations on multiple sub-tensors and sum them, and obtain the positive transformation of the feature data according to the result of the summation operation data.
  • the main arithmetic unit is specifically used to parse the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, and the number of multiple sub-tensors is non-zero in the feature data.
  • the number of elements is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
  • the main arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor sets the non-zero elements of the sub-tensor to 1 Tensor; multiply the non-zero element value of the sub-tensor as the coefficient by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; add the winograd transformation results of multiple sub-tensors to obtain the feature data The data is being transformed.
  • the main arithmetic unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the sub-tensor
  • the slave arithmetic unit is specifically used to perform bitwise multiplication on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain the result of the multiplication operation;
  • the multiplication operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.
  • the arithmetic unit is specifically used to obtain multiple sub-tensors from the multiplication operation result analysis, where the multiplication operation result is the sum of the multiple sub-tensors, and the number of the multiple sub-tensors and the multiplication operation result
  • the number of non-zero elements is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.
  • the slave arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor sets the non-zero elements of the sub-tensor to 1 Tensor; multiply the non-zero element value in the sub-tensor as the coefficient by the winograd transformation result of the corresponding meta-sub-tensor to obtain the winograd transformation result of the sub-tensor; add the winograd transformation results of multiple sub-tensors to obtain the winograd convolution The result of the calculation.
  • the slave operation unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the sub-tensor
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data
  • the slave storage unit is used to receive weight data and store it.
  • the slave arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; transform and sum the multiple sub-tensors, and obtain the weight data according to the result of the summation operation.
  • the data is being transformed.
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data and weight data
  • the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; transform and sum the multiple sub-tensors, and obtain the weight data according to the result of the summation operation.
  • Positive transformation data
  • the positive transformation data of the weight data is sent to the slave storage unit.
  • the main computing unit includes: a main processing module and a cache;
  • the main processing module is configured to respond to the first control instruction, extract data from the storage unit and perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result;
  • the buffer is also used to send the result of the positive conversion to the slave arithmetic unit when the stored result of the positive conversion accumulates to a preset number.
  • the slave operation unit is also used to store the winograd convolution operation result in the preset address space of the storage unit.
  • the main arithmetic unit is communicatively connected with multiple slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations.
  • the operation process of the main arithmetic unit and the slave arithmetic unit are parallel operations.
  • the slave arithmetic unit performs the positive transformation of the calculated feature data.
  • the element position of the data and the element position of the positive transformation data of the corresponding weight data are subjected to a bitwise multiplication operation until the bitwise multiplication operation value of each element position is calculated, and the multiplication result is obtained.
  • the characteristic data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel;
  • the positive transformation data of the weight data stored in the storage unit is divided into a plurality of second data used for winograd convolution operation, and the size of the second data is determined according to the size of the first data;
  • the main processing module of the main arithmetic unit responds to the first control instruction sent by the main control unit, obtains the first data in sequence from the main storage unit, performs a forward transformation operation on the first data to obtain the positive transformation result of the first data, and combines The result of the positive transformation of one data is stored in the cache;
  • the main processing module of the main arithmetic unit sends the positive transformation result of the first data in the buffer to the slave arithmetic unit in turn;
  • the slave arithmetic unit responds to the second control instruction sent by the control unit, obtains the second data from the storage unit, performs a bitwise multiplication operation on the positive transformation result of the first data and the second data to obtain the bitwise multiplication result, and Perform the inverse transformation operation on the result of the bit multiplication operation to obtain the inverse transformation result;
  • the slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.
  • the operation method according to the embodiment of the present disclosure can be applied to any one processor of a processing system (for example, an artificial intelligence chip) including multiple processors (multi-core).
  • the processor may be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor (IPU) for performing artificial intelligence operations.
  • Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on.
  • the artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips.
  • GPU Graphics Processing Unit
  • NPU Neuro-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing unit
  • field programmable gate array Field-Programmable Gate Array, FPGA
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as convolution computing tasks and pooling tasks. Or fully connected tasks, etc.
  • the present disclosure does not limit the processing unit and the tasks run by the processing unit.
  • Fig. 5 is a schematic diagram of a processing system of an arithmetic method according to an embodiment of the present disclosure.
  • the processing system 10 includes multiple processors 11 and a memory 12, the multiple processors 11 are used to execute instruction sequences, and the memory 12 is used to store data, and may include random access memory (RAM, Random Access Memory) and registers. heap.
  • RAM random access memory
  • the multiple processors 11 in the processing system 10 can share part of the storage space, for example, share part of the RAM storage space and the register file, and can also have their own storage space at the same time.
  • steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the storage unit can be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), Dynamic Random Access Memory (DRAM), and static random access memory.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • static random access memory Access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc.
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, A number of instructions are included to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods of the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • an artificial intelligence chip is also disclosed, which includes the aforementioned computing device.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
  • the storage device is used to store data;
  • the interface device is used to realize data transmission between the artificial intelligence chip and external equipment;
  • the control device is used to monitor the state of the artificial intelligence chip.
  • Fig. 6 is a structural block diagram of a board according to an embodiment of the present disclosure.
  • the board may include other supporting components in addition to the chip 389 described above.
  • the supporting components include, but are not limited to: a storage device 390, an interface Device 391 and control device 392;
  • the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple sets of storage units 393.
  • Each group of storage units is connected to the artificial intelligence chip through a bus. It can be understood that each group of storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • DDR does not need to increase the clock frequency to double the speed of SDRAM.
  • DDR allows data to be read on the rising and falling edges of the clock pulse.
  • the speed of DDR is twice that of standard SDRAM.
  • the storage device may include 4 sets of storage units.
  • Each group of memory cells can include multiple DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of memory cells, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of storage units includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller that controls the DDR is provided in the chip to control the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to realize data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted back to the external device (such as the server) by the interface device.
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • an artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, which can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
  • an electronic device which includes the aforementioned artificial intelligence chip.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • Transportation includes airplanes, ships, and/or vehicles; household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, range hoods; medical equipment includes nuclear magnetic resonance, ultrasound, and/or Electrocardiograph.
  • An arithmetic device for performing winograd convolution operations including: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit;
  • the main control unit is used to send a first control instruction, the first control instruction is used to instruct the main arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and to instruct the slave control unit to send a second control instruction, the second control instruction is used to Instruct the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation from the arithmetic unit;
  • Storage unit used to store data used for winograd convolution operation
  • the main operation unit is used to respond to the first control instruction, extract data from the storage unit and perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein the positive transformation operation is disassembled into a summation operation;
  • the slave operation unit is used to respond to the second control instruction, obtain the result of the forward transformation operation from the main operation unit, and extract the data from the storage unit, and perform the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd volume
  • the result of the product operation; among them, the inverse transformation operation is broken down into a summation operation.
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data
  • the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main arithmetic unit is specifically used to disassemble the feature data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the feature data according to the result of the summation operation.
  • the main arithmetic unit is specifically used to parse the feature data to obtain multiple sub-tensors, where the feature data is the sum of multiple sub-tensors, the number of multiple sub-tensors is the same as the number of non-zero elements in the feature data, and each sub-tensor is the same as the number of non-zero elements in the feature data. There is a single non-zero element in the tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the feature data.
  • the main arithmetic unit is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1, and the sub-tensor is non-zero
  • the element value of is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the positive transformation data of the feature data.
  • the main arithmetic unit is specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the element
  • the winograd transformation result of the sub-tensor where the left-multiplication matrix and the right-multiplication matrix are both determined by the size of the sub-tensor and the winograd transformation type, where the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation.
  • the slave arithmetic unit is specifically used to perform a bitwise multiplication operation on the positive transformation data of the characteristic data and the positive transformation data of the weight data to obtain the result of the multiplication operation;
  • the multiplication operation result is disassembled into multiple sub-tensors; the multiple sub-tensors are transformed and summed to obtain the winograd convolution operation result.
  • the arithmetic unit From the arithmetic unit, it is specifically used to analyze the result of the multiplication operation to obtain multiple sub-tensors, where the result of the multiplication operation is the sum of multiple sub-tensors, and the number of the multiple sub-tensors is the same as the number of non-zero elements in the multiplication operation result , There is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element in the corresponding position in the result of the multiplication operation.
  • the arithmetic unit From the arithmetic unit, it is specifically used to obtain the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor with the non-zero elements of the sub-tensor set to 1, and the sub-tensor is non-zero
  • the element value of is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd convolution operation result.
  • Clause 10 The arithmetic device according to Clause 9, from the arithmetic unit, specifically used for each sub-tensor, multiplying the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix, and the right multiplying by the right multiplying matrix to obtain the element
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data
  • the slave storage unit is used to receive weight data and store it.
  • the slave arithmetic unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the weight data according to the result of the summation operation.
  • the storage unit includes: a master storage unit and a slave storage unit;
  • the main storage unit is used to receive and store characteristic data and weight data
  • the slave storage unit is used to receive and store the positive transformation data of the weight data.
  • the main operation unit is also used to disassemble the weight data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the positive transformation data of the weight data according to the result of the summation operation;
  • the positive transformation data of the weight data is sent to the slave storage unit.
  • the main computing unit includes: a main processing module and a buffer;
  • the main processing module is used to respond to the first control instruction to extract data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result;
  • the buffer is also used to send the result of the forward conversion to the slave arithmetic unit when the stored result of the forward conversion has accumulated to a preset number.
  • the slave arithmetic unit is also used to store the result of the winograd convolution operation in the preset address space of the storage unit.
  • the main arithmetic unit is communicatively connected with multiple slave arithmetic units, and different slave arithmetic units are responsible for calculating the results of different forward transformation operations.
  • the operation process of the main arithmetic unit and the slave arithmetic unit are parallel operations.
  • the slave arithmetic unit performs the calculation of the positive data of the feature data.
  • the element position of the transformed data and the element position of the positive transformation data of the corresponding weight data are subjected to bitwise multiplication until the bitwise multiplication value of each element position is calculated, and the multiplication result is obtained.
  • the feature data stored in the main storage unit is divided into a plurality of first data used for winograd convolution operation, and the size of the first data is determined according to the size of the convolution kernel;
  • the positive transformation data of the weight data stored in the storage unit is divided into a plurality of second data used for winograd convolution operation, and the size of the second data is determined according to the size of the first data;
  • the main processing module of the main arithmetic unit responds to the first control instruction sent by the main control unit, obtains the first data in sequence from the main storage unit, performs a forward transformation operation on the first data to obtain the positive transformation result of the first data, and combines The result of the positive transformation of one data is stored in the cache;
  • the main processing module of the main arithmetic unit sends the positive transformation result of the first data in the buffer to the slave arithmetic unit in turn;
  • the slave arithmetic unit responds to the second control instruction sent by the control unit, obtains the second data from the storage unit, performs a bitwise multiplication operation on the positive transformation result of the first data and the second data to obtain the bitwise multiplication result, and Perform the inverse transformation operation on the result of the bit multiplication operation to obtain the inverse transformation result;
  • the slave operation unit obtains the winograd convolution operation result according to the inverse transformation result, and sends the winograd convolution operation result to the main storage unit for storage.
  • Clause 21 An artificial intelligence chip, the chip including a computing device as in any one of clauses 1-20.
  • Clause 22 An electronic device.
  • the electronic device includes an artificial intelligence chip as in Clause 21.
  • a board card which includes: a storage device, an interface device, a control device, and an artificial intelligence chip as in Clause 21;
  • the artificial intelligence chip is connected to the storage device, the control device and the interface device respectively;
  • Storage device used to store data
  • Interface device used to realize data transmission between artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the storage device includes: multiple groups of storage units, each group of storage units is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;
  • the chip includes: DDR controller, used to control the data transmission and data storage of each storage unit;
  • the interface device is: standard PCIE interface.
  • the arithmetic device includes: a master control unit, a slave control unit, a storage unit, a master arithmetic unit, and a slave arithmetic unit; the method includes:
  • the master control unit sends a first control instruction, the first control instruction is used to instruct the master arithmetic unit to perform the forward transformation operation in the winograd convolution operation, and the slave control unit to send a second control instruction, the second control instruction is used to instruct the slave operation
  • the unit performs multiplication and addition operations and inverse transformation operations in winograd convolution operations;
  • the storage unit stores data used for winograd convolution operation
  • the main arithmetic unit responds to the first control instruction and extracts data from the storage unit to perform the positive transformation operation in the winograd convolution operation to obtain the positive transformation operation result; wherein, the positive transformation operation is disassembled into a summation operation;
  • the slave operation unit responds to the second control instruction, obtains the result of the forward transformation operation from the main operation unit, and extracts data from the storage unit, and performs the multiplication and addition operation and the inverse transformation operation in the winograd convolution operation to obtain the winograd convolution operation result ; Among them, the inverse transform operation is broken down into a summation operation.
  • the above scheme can effectively improve the processing efficiency of the chip.
  • the matrix multiplication is still performed in the winograd convolution forward transformation and the winograd convolution inverse transformation, there is still a large overhead in the hardware implementation process. Therefore, in order to further improve the processing efficiency, this The application embodiment also proposes a winograd convolution operation method.
  • the winograd convolution operation method can be applied in the hardware implementation process of the convolutional neural network.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)
  • Image Processing (AREA)

Abstract

La présente invention concerne un dispositif et un procédé de calcul ainsi qu'un produit associé. Le dispositif comprend une unité de commande maître, une unité de commande esclave, une unité de stockage, une unité de calcul maître et une unité de calcul esclave. L'invention améliore efficacement le taux de rendement énergétique et la vitesse de calcul de réseaux d'apprentissage profond en termes d'architecture matérielle, ce qui améliore la performance des réseaux d'apprentissage profond.
PCT/CN2020/113160 2019-11-01 2020-09-03 Dispositif et procédé de calcul, et produit associé WO2021082722A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911061078.3 2019-11-01
CN201911061078.3A CN112765539B (zh) 2019-11-01 2019-11-01 运算装置、方法及相关产品

Publications (1)

Publication Number Publication Date
WO2021082722A1 true WO2021082722A1 (fr) 2021-05-06

Family

ID=75692126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113160 WO2021082722A1 (fr) 2019-11-01 2020-09-03 Dispositif et procédé de calcul, et produit associé

Country Status (2)

Country Link
CN (1) CN112765539B (fr)
WO (1) WO2021082722A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325591A (zh) * 2018-09-26 2019-02-12 中国科学院计算技术研究所 面向Winograd卷积的神经网络处理器
CN109359730A (zh) * 2018-09-26 2019-02-19 中国科学院计算技术研究所 面向固定输出范式Winograd卷积的神经网络处理器

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11055063B2 (en) * 2016-05-02 2021-07-06 Marvell Asia Pte, Ltd. Systems and methods for deep learning processor
WO2018107383A1 (fr) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 Procédé et dispositif de calcul de convolution d'un réseau de neurones artificiels, et support d'enregistrement lisible par ordinateur
CN108229656A (zh) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 神经网络运算装置及方法
WO2018108126A1 (fr) * 2016-12-14 2018-06-21 上海寒武纪信息科技有限公司 Dispositif et procédé pour opération de convolution de réseau neuronal
US10482155B2 (en) * 2016-12-30 2019-11-19 Intel Corporation Winograd algorithm on a matrix processing architecture
US10990648B2 (en) * 2017-08-07 2021-04-27 Intel Corporation System and method for an optimized winograd convolution accelerator
US10372787B2 (en) * 2017-12-12 2019-08-06 Facebook, Inc. Hardware accelerator pre-configured with coefficients for matrix-transform operations
CN110163349B (zh) * 2018-02-12 2021-03-23 上海寒武纪信息科技有限公司 一种网络模型的计算方法及装置
CN110147249B (zh) * 2018-02-12 2021-02-09 上海寒武纪信息科技有限公司 一种网络模型的计算方法及装置
US11586907B2 (en) * 2018-02-27 2023-02-21 Stmicroelectronics S.R.L. Arithmetic unit for deep learning acceleration

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325591A (zh) * 2018-09-26 2019-02-12 中国科学院计算技术研究所 面向Winograd卷积的神经网络处理器
CN109359730A (zh) * 2018-09-26 2019-02-19 中国科学院计算技术研究所 面向固定输出范式Winograd卷积的神经网络处理器

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FENG SHI, HAOCHEN LI, YUHE GAO, BENJAMIN KUSCHNER, SONG-CHUN ZHU: "Sparse Winograd Convolutional neural networks on small-scale systolic arrays", COMPUTER SCIENCE, 3 October 2018 (2018-10-03), pages 1 - 7, XP080933823 *

Also Published As

Publication number Publication date
CN112765539B (zh) 2024-02-02
CN112765539A (zh) 2021-05-07

Similar Documents

Publication Publication Date Title
CN109522052B (zh) 一种计算装置及板卡
CN109543832B (zh) 一种计算装置及板卡
TWI795519B (zh) 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡及執行機器學習計算的方法
CN109670581B (zh) 一种计算装置及板卡
CN110059797B (zh) 一种计算装置及相关产品
WO2021082725A1 (fr) Procédé d'opération de convolution winograd et produit associé
WO2021083101A1 (fr) Procédé et appareil de traitement de données, et produit connexe
CN115221102B (zh) 用于优化片上系统的卷积运算操作的方法和相关产品
WO2021185262A1 (fr) Appareil de calcul et procédé, carte de panneau et support de stockage lisible par ordinateur
WO2021082723A1 (fr) Appareil d'execution
WO2021082722A1 (fr) Dispositif et procédé de calcul, et produit associé
WO2021082746A1 (fr) Appareil d'exploitation et produit associé
WO2021082747A1 (fr) Appareil d'exploitation et produit associé
WO2021223642A1 (fr) Procédé et appareil de traitement de données, et produit associé
WO2021082721A1 (fr) Procédé, appareil et dispositif de fonctionnement de convolution de winograd, et support de stockage
WO2021082724A1 (fr) Procédé d'opération et produit associé
CN111382852B (zh) 数据处理装置、方法、芯片及电子设备
CN111047030A (zh) 运算方法、装置、计算机设备和存储介质
CN111061507A (zh) 运算方法、装置、计算机设备和存储介质
WO2021223644A1 (fr) Procédé et dispositif de traitement de données, et produit associé
CN111222632B (zh) 计算装置、计算方法及相关产品
WO2021169914A1 (fr) Procédé et appareil de traitement par quantification de données, dispositif électronique et support de stockage
WO2021223645A1 (fr) Procédé et appareil de traitement de données, et produit associé
WO2021212972A1 (fr) Procédé de fonctionnement, processeur et produit associé
WO2021223638A1 (fr) Procédé et dispositif de traitement de données, et produit associé

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20880593

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20880593

Country of ref document: EP

Kind code of ref document: A1