WO2021082747A1 - Operational apparatus and related product - Google Patents

Operational apparatus and related product Download PDF

Info

Publication number
WO2021082747A1
WO2021082747A1 PCT/CN2020/114057 CN2020114057W WO2021082747A1 WO 2021082747 A1 WO2021082747 A1 WO 2021082747A1 CN 2020114057 W CN2020114057 W CN 2020114057W WO 2021082747 A1 WO2021082747 A1 WO 2021082747A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
winograd
tensor
result
input data
Prior art date
Application number
PCT/CN2020/114057
Other languages
French (fr)
Chinese (zh)
Inventor
张英男
江广
刘少礼
高钰峰
于涌
周徐达
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Publication of WO2021082747A1 publication Critical patent/WO2021082747A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/30036Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present disclosure relates to the field of data processing technology, and in particular to a computing device and related products.
  • neural network algorithm is a very popular machine learning algorithm recently, and has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc.
  • image recognition speech recognition
  • speech recognition natural language processing
  • the complexity of the algorithm is getting higher and higher.
  • the scale of the model is gradually increasing.
  • Using GPU and CPU to process these large-scale models requires a lot of computing time and consumes a lot of power.
  • an arithmetic device including: a master instruction processing unit, a master functional unit, a slave instruction processing unit, and a slave functional unit,
  • the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
  • the main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron
  • the master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
  • the slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result.
  • the inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
  • an artificial intelligence chip including the arithmetic device described above
  • an electronic device characterized in that the electronic device includes the artificial intelligence chip as described above.
  • the winograd forward transformation result of the input data is obtained by calculating the winograd forward transformation result of the input data by disassembling the winograd forward transformation of the input data into a summation operation, and the winograd forward transformation result of the input data includes the winograd of the input neuron The result of the positive transformation, and then multiply the result of the winograd positive transformation of the input neuron and the result of the weight of the winograd positive transformation to obtain the result of the parallel multiplication.
  • the inverse winograd transform of the result of the alignment multiply is disassembled into a summation operation, and Perform calculation to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
  • Fig. 1 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure
  • FIG. 2 shows a flowchart of a winograd positive transformation result of input data calculated according to a first subtensor according to an embodiment of the present disclosure
  • Fig. 3 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure
  • Fig. 4 shows a flow chart of performing winograd inverse transformation on a bit multiplication result according to an embodiment of the present disclosure
  • Fig. 5 shows a schematic diagram of a processor according to an embodiment of the present disclosure
  • Fig. 6 shows a structural block diagram of a board according to an embodiment of the present disclosure.
  • the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: neurons and weights are divided into a certain scale and then linearly transformed (winograd positive transformation), and then the transformed neurons and weights are multiplied by bit, and finally the pair The bit multiplication result is linearly transformed again (winograd inverse transformation) to obtain a convolution result equivalent to the original convolution operation.
  • g represents the weight value
  • G represents the left multiplication positive transformation matrix corresponding to the weight value
  • G T represents the right multiplication positive transformation matrix corresponding to the weight value
  • d represents the input neuron
  • B represents the right multiplication positive transformation matrix corresponding to the input neuron
  • B T represents the left multiplication positive transformation matrix corresponding to the input neuron
  • A represents the right multiplication inverse transformation matrix
  • AT represents the left multiplication inverse transformation matrix.
  • the present disclosure provides an arithmetic device that can disassemble the multiplication operation in the winograd convolution process into an addition operation, thereby saving calculation time and reducing energy consumption.
  • Fig. 1 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure.
  • the computing device provided by the present disclosure may include a master instruction processing unit, a master function unit, and a master memory unit, a slave instruction processing unit, a slave function unit, and a slave memory unit.
  • the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
  • the main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron
  • the master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
  • the slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result.
  • the inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
  • the winograd forward transformation result of the input data is obtained by calculating the winograd forward transformation result of the input data by disassembling the winograd forward transformation of the input data into a summation operation, and the winograd forward transformation result of the input data includes the winograd of the input neuron The result of the positive transformation, and then multiply the result of the winograd positive transformation of the input neuron and the result of the weight of the winograd positive transformation to obtain the result of the parallel multiplication.
  • the inverse winograd transform of the result of the alignment multiply is disassembled into a summation operation, and Perform calculation to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
  • the above-mentioned input instruction may refer to the "WINO_CONV" instruction, which includes the processes of winograd forward transformation, bitwise multiplication, and winograd inverse transformation.
  • the input instruction may carry information about the operation and operand corresponding to the input instruction, and the operand information may include: the address information of the operand, the size of the operand, and so on.
  • the operand information can include the address information of the input neuron, the address information of the weight, and the output operand obtained (the input operand of the next layer, or the input neuron of the next layer) to be stored Address information and so on.
  • the input data of this layer can include input neurons
  • the input neurons of this layer can be the output results of the previous layer
  • the input neurons can be the initial Input data.
  • the initial input data can be image data, sound data, or video data.
  • winograd forward transformation bit multiplication
  • winograd inverse transformation The processes of winograd forward transformation, bit multiplication, and winograd inverse transformation are described below.
  • the main instruction processing unit is further configured to send a first control signal to the main memory unit according to the input instruction after receiving an input instruction; the main memory unit is configured to send a first control signal to the main functional unit according to the first control signal The input data.
  • the main instruction processing unit can decode and parse the input instruction to obtain the address information of the operation and the operand. Then the main instruction processing unit may send the first control signal to the main memory unit and the main function unit according to the information obtained by the analysis.
  • the main memory unit may obtain input data according to the first control signal.
  • the input data may include input neurons, etc., and the input data may be data represented in the form of a tensor.
  • the input data can be expressed in the form of NHWC (batch, height, width, channels), N represents the number of images, HW can respectively represent the number of pixels in the height and width directions of the image, C It can represent the number of channels, for example, C can represent three channels of RGB (Red, Green, Blue). It should be noted that the above representation is only an example of the present disclosure, and the present disclosure is not limited to this.
  • the main memory unit can send the input data to the main function unit after obtaining the input data.
  • the main functional unit After receiving the first control signal and the input data, the main functional unit can perform calculations according to the operation in the first control signal and the input data, and disassemble the winograd forward transformation of the input data into a summation operation, and perform calculations to obtain the The winograd of the input data is transforming the result.
  • the specific process may be that the main function unit is used to disassemble the input data into a plurality of first sub-tensors according to the first control signal, and perform winograd forward transformation on the plurality of first sub-tensors and sum them. Obtain the winograd positive transformation result of the input data.
  • the number of the plurality of first sub-tensors is the same as the number of elements of the input data.
  • one element in each first sub-tensor is the same as the element at the corresponding position in the input data, and other elements are all zero.
  • the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the input data, and each of the plurality of first sub-tensors One element in the first sub-tensor is the same as the element at the corresponding position in the input data, and the other elements are all 0.
  • the input neuron is a 4 ⁇ 4 matrix including 16 elements. Therefore, the input data can be decomposed into 16 first sub-tensors.
  • the 16 first sub-tensors are:
  • each first subtensor there is an element in each first subtensor that is the same as the element at the corresponding position in the input data, and the other elements are all 0.
  • the above disassembly methods are only some examples of the present disclosure, and do not limit the present disclosure in any way.
  • the number of first subtensors obtained by the disassembly can be The number of elements less than the input data, for example, the number of multiple first subtensors is the same as the number of non-zero elements of the input data.
  • Fig. 2 shows a flow chart of a winograd positive transformation result of input data calculated according to a first subtensor according to an embodiment of the present disclosure.
  • performing winograd forward transformation on the multiple first subtensors and summing them to obtain the winograd forward transformation result of the input data may include the following process:
  • Step S21 Obtain the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor; wherein, the first sub-tensor corresponding to the first sub-tensor is: the first position in the first sub-tensor The value of the element is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;
  • Step S22 multiplying the value of an element in the first sub-tensor that is not 0 as a coefficient by the winograd positive transformation result of the corresponding first-element sub-tensor to obtain the winograd positive transformation result of the first sub-tensor;
  • Step S23 adding the winograd positive transformation results of the multiple first subtensors to obtain the winograd positive transformation result of the input data.
  • the first-element sub-tensor corresponding to d 00 can be
  • the first sub-tensor is to extract the values of non-zero elements in the first sub-tensor, and the values of non-zero elements can be used as coefficients of the first sub-tensor.
  • the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor can be obtained in advance through the following process: For each first sub-tensor, the first sub-tensor corresponding to the first sub-tensor The left side of the sub-tensor is multiplied by the positive transformation, the left-multiplied matrix, and the right is multiplied by the positive transformation, and the right-multiplied matrix is used to obtain the winograd positive transformation result of the first sub-tensor.
  • the form of the corresponding first element sub-tensor is determined, and the corresponding positive transformation left-multiplication matrix and forward transformation right-multiplication matrix are also determined.
  • the winograd positive transformation result of the first sub-tensor can be calculated in advance, and the specific process is as described above.
  • the corresponding winograd positive transformation result of the first sub-tensor is:
  • the winograd positive transformation result of the corresponding first-element sub-tensor is:
  • the matrix multiplication operation can be broken down into an addition operation.
  • the process of calculating the winograd positive transformation result of the first element sub-tensor involves more multiplication operations.
  • the pre-calculated winograd positive transformation results of the first element subtensor of various scales can be stored in In the computing device, in this way, in the actual computing process, it can be directly obtained without repeated computing, thereby shortening computing time and saving computing resources.
  • the value of the non-zero element in the first sub-tensor can be multiplied by the winograd positive transformation result of the corresponding first sub-tensor, then The winograd positive transformation result of the first subtensor can be obtained.
  • the corresponding winograd positive transformation result is:
  • the winograd positive transformation results of all the first sub-tensors are calculated through the above process, and the winograd positive transformation results of multiple first sub-tensors are added to obtain the winograd positive transformation results of the input data.
  • the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor obtained in advance and the first The non-zero element value of the subtensor can be summed to obtain the winograd positive transformation result of the input data.
  • disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
  • the main functional unit includes a cache module, and the main functional unit stores the winograd positive transformation result of the input data in the cache module.
  • Fig. 3 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure.
  • the main function unit may include a main processing module and a cache module, and the main processing module may be used to execute the main function unit of all the above-mentioned embodiments to input data according to the first control signal.
  • the winograd positive transformation is decomposed into a summation operation, and the process of calculating the winograd positive transformation result of the input data is performed.
  • the main functional unit can store the winograd positive conversion result of the input data in the cache module, and the cache module sends the winograd positive conversion result of the input data to the slave functional unit, so that the slave functional unit of the computing device can check the winograd positive conversion result.
  • the transformation result is a process of bitwise multiplication and winograd inverse transformation.
  • the master instruction processing unit is also used to send a second control signal to the slave instruction processing unit.
  • the master instruction processing unit can The second control signal is sent to the slave instruction processing unit according to the WINO_MIT instruction, and the master functional unit (specifically, the cache module in the master functional unit) is also used to send the winograd positive transformation result of the input data to the slave functional unit.
  • the second control signal may be used to instruct the slave command processing unit to control the slave functional unit to further process the winograd conversion result of the input data.
  • the slave instruction processing unit is configured to receive the second control signal sent by the master instruction processing unit, and send the second control signal to the slave functional unit and the slave memory unit.
  • the slave memory unit is configured to send the winograd positive transformation result of the weight to the slave functional unit according to the second control signal.
  • the slave functional unit is used to receive the winograd positive transformation result of the input data sent by the main functional unit, and the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron.
  • the winograd positive transformation result of the weight can be pre-calculated, and the calculation method of the winograd positive transformation result of the weight can adopt a traditional matrix multiplication operation, or refer to the disassembly mentioned above as a summation operation. Calculated.
  • the weights are disassembled into a plurality of first sub-tensors, and winograd positive transformations are performed on the plurality of first sub-tensors and summed to obtain the winograd positive transformation results of the weights.
  • a matrix with a weight of 3 ⁇ 3 includes 9 elements. Therefore, the input data can be decomposed into 9 first sub-tensors.
  • the 9 first sub-tensors are:
  • each first subtensor is the same as the element at the corresponding position in the weight, and the other elements are all zero.
  • step S21-step S23 the winograd positive transformation result of the weight can be calculated, which will not be repeated.
  • the functional unit is used to perform bit-multiplication on the winograd forward transformation result of the input neuron and the winograd forward transformation result of the weight value according to the second control signal to obtain the bit-on multiplication result, and to inverse the winograd result of the bit-on multiplication result
  • the transformation and disassembly are summation operations, and calculations are performed to obtain the winograd convolution result of the input data.
  • bitwise multiplication may refer to the data obtained by multiplying the data at the corresponding positions of the two tensors as the value of the corresponding position in the bitwise multiplication result.
  • Weighted winograd positive transformation result It can be expressed as:
  • the result of counter multiplication can be:
  • the winograd convolution result of the input data can be expressed as
  • the slave function processing unit of the present disclosure can combine
  • the disassembly is a summation operation, and calculation is performed to obtain the winograd convolution result of the input data, which can further save calculation time and reduce energy consumption.
  • the slave function unit is used to disassemble the alignment multiplication result into a plurality of second sub-tensors.
  • the multiple second sub-tensors are subjected to winograd inverse transformation and summed to obtain the winograd convolution result of the input data.
  • the number of the multiple second sub-tensors is the same as the number of elements of the parametric multiplication result, and each of the multiple second sub-tensors has one
  • the element is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
  • the number of the plurality of second sub-tensors is the same as the number of non-zero elements in the result of the bitwise multiplication, and each second sub-tensor of the plurality of second sub-tensors One element is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
  • the result of the alignment multiplication is disassembled into multiple second sub-tensors, for example, it can be disassembled into 16, and the 16 second sub-tensors are:
  • winograd inverse transformation can be performed on the multiple second sub-tensors and summed to obtain the winograd convolution result of the input data.
  • Fig. 4 shows a flow chart of performing winograd inverse transformation on a bit multiplication result according to an embodiment of the present disclosure.
  • performing winograd inverse transformation on the multiple second subtensors and summing them to obtain the winograd convolution result of the input data may include the following process:
  • Step S41 Obtain the winograd inverse transform result of the second sub-tensor corresponding to the second sub-tensor; wherein, the second sub-tensor corresponding to the second sub-tensor is: the second position in the second sub-tensor The value of the element is 1, where the position of the second position in the second sub-tensor is the same as the position of the non-zero element in the second sub-tensor;
  • Step S42 multiplying the non-zero element value of the second sub-tensor as a coefficient by the winograd inverse transform result of the corresponding second-element sub-tensor to obtain the winograd inverse transform result of the second sub-tensor;
  • Step S43 Add the winograd inverse transform results of the multiple second sub-tensors to obtain the winograd convolution result of the input data.
  • the method for determining the second meta-sub-tensor corresponding to the second sub-tensor is the same as the method for determining the first meta-sub-tensor above, and will not be repeated here.
  • the winograd inverse transform result of the second sub-tensor is obtained in advance through the following process: For each second sub-tensor, the left side of the second sub-tensor corresponding to the second sub-tensor is multiplied by the inverse transform Multiplying the matrix on the left, multiplying the matrix on the right by the inverse transformation, and multiplying the matrix on the right to obtain the winograd inverse transformation result of the second element subtensor.
  • the form of the corresponding second-element sub-tensor is determined, and the corresponding inverse transform left multiplication matrix and inverse transform right multiplication matrix are also determined. Therefore, the winograd inverse transformation result of the second sub-tensor can be calculated in advance, and the specific process is as described above.
  • the left multiplication matrix of the inverse transformation is a 2 ⁇ 4 matrix, for example:
  • the inverse transformation right multiplication matrix is a 4 ⁇ 2 matrix, for example:
  • the dimension of the inverse transformation matrix can be determined according to the dimension of the input neuron and the dimension of the weight value and the convolution step length.
  • the above is only an example, and the present disclosure is not limited in any way.
  • the inverse transformation matrix is given by Therefore, the matrix multiplication operation of the inverse transformation can be broken down into addition and shift operations. Multiply the inverse transformation matrix by the second-element sub-tensor to obtain the winograd inverse transformation result of the second-element sub-tensor.
  • the element value in the winograd inverse transformation result of the second-element sub-tensor is determined by With other configurations, fractions can be calculated by simple shift operations, which can still save calculation time compared to multiplication operations.
  • steps S42 and S43 please refer to the above steps S22 and S23, except that the winograd inverse transformation result of the second element subtensor is not completely 0, ⁇ 1, but the score can be calculated by a simple shift operation, Compared with the multiplication operation, the present disclosure can still achieve the effects of saving calculation time and reducing energy consumption after disassembling the ordinary inverse transformation process.
  • multiple second sub-tensors are obtained by disassembling the bit-multiplication results, and the winograd inverse transform results of the second-element sub-tensors corresponding to the second sub-tensors obtained in advance and The non-zero element value of the second subtensor can be summed to obtain the winograd convolution result of the input data.
  • disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
  • the slave function unit may include an alignment multiplication module and an inverse transformation module, wherein the alignment multiplication module can be used to perform the above alignment multiplication operation to obtain the alignment.
  • the result of the bit multiplication can be sent to the inverse transform module, and the inverse transform module executes the above-mentioned winograd inverse transform of the result of the bit multiplication into a summation operation, and performs calculation to obtain the Winograd convolution result of input data.
  • the slave function unit is further configured to perform post-processing on the winograd convolution result of the input data, and the post-processing includes a bitwise rounding operation and a revolution operation.
  • the rounding operation may round the winograd convolution result of the input data according to the set number of rounding bits.
  • the number of revolution operations may refer to processing the arrangement of the winograd convolution result of the input data.
  • the arrangement of the winograd convolution result of the input data can be changed according to storage requirements. Post-processing the winograd convolution results of the input data is more conducive to subsequent operations and calculations.
  • the slave function unit is further configured to send the winograd convolution result of the input data to the main memory unit as the input neuron of the next layer of convolution operation.
  • the computing device can be applied to a processor, and the processor can be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor for performing artificial intelligence operations. (IPU).
  • Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on.
  • the artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips.
  • GPU Graphics Processing Unit
  • NPU Neuro-Network Processing Unit
  • DSP Digital Signal Process, digital signal processing unit
  • field programmable gate array Field-Programmable Gate Array
  • FPGA Field-Programmable Gate Array
  • the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as convolution computing tasks and pooling tasks. Or fully connected tasks, etc.
  • the present disclosure does not limit the processing unit and the tasks run by the processing unit.
  • Fig. 5 shows a schematic diagram of a processor according to an embodiment of the present disclosure.
  • the processor is used to perform machine learning calculations.
  • the processor includes a controller unit 141 and an arithmetic unit 142.
  • the controller unit 141 is connected to the arithmetic unit 142.
  • the arithmetic unit 142 includes: a main processing circuit and Multiple slave processing circuits;
  • the controller unit 141 is configured to obtain input data and calculation instructions; the calculation instructions obtained by the controller unit 141 may be one or more operators in the first fusion set after the operators are fused by the first processor.
  • one main processing circuit and multiple slave processing circuits may have a tree structure, an H-type structure or a pulse array machine structure.
  • the present disclosure does not limit the connection between the main processing circuit and the slave processing circuit.
  • the input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or more data I/O interfaces or I/O pins.
  • the above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, such as convolution operation instructions, or may also be the “WINO_CONV” instruction mentioned above.
  • This application specifically The implementation manner does not limit the specific expression form of the foregoing calculation instruction.
  • the controller unit 141 is further configured to parse the calculation instruction to obtain multiple calculation instructions, and send the multiple calculation instructions and the input data to the main processing circuit;
  • the main processing circuit 101 is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
  • a plurality of slave processing circuits 102 are configured to perform intermediate operations in parallel to obtain a plurality of intermediate results according to the data and operation instructions transmitted from the main processing circuit, and transmit the plurality of intermediate results to the main processing circuit;
  • the main processing circuit 101 is configured to perform subsequent processing on the multiple intermediate results to obtain the calculation result of the calculation instruction.
  • the technical solution provided by this application sets the arithmetic unit into a master-multi-slave structure.
  • it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can be used
  • the larger part of the calculation is performed in parallel, thereby increasing the speed of the calculation, saving the calculation time, and thereby reducing the power consumption.
  • the above-mentioned machine learning calculation may specifically include: artificial neural network operations
  • the above-mentioned input data may specifically include: input neuron data and weight data.
  • the above calculation result may specifically be: the result of the artificial neural network operation is the output neuron data.
  • the operation in the neural network can be the operation of one layer in the neural network.
  • the realization process is that in the forward operation, when the operation of the artificial neural network of the previous layer is completed, the operation of the next layer The instruction will operate the output neuron calculated in the computing unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, set the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer operation instruction will use the input neuron gradient calculated in the operation unit as the next The output neuron gradient of one layer is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced with the weights of the next layer.
  • the above-mentioned machine learning calculations may also include support vector machine calculations, k-nearest neighbor (k-nn) calculations, k-means calculations, principal component analysis calculations, and so on.
  • k-nn k-nearest neighbor
  • k-means k-means
  • the input neurons and output neurons of the multi-layer operation do not refer to the neurons in the input layer and the neurons in the output layer of the entire neural network.
  • the neuron in the lower layer of the network forward operation is the input neuron
  • the neuron in the upper layer of the network forward operation is the output neuron.
  • the foregoing processor may further include: the storage unit 140 and the direct memory access unit 50.
  • the storage unit 140 may include one or any combination of a register and a cache. Specifically, the cache is used to store the Calculation instruction; the register is used to store the input data and scalar; the cache is a high-speed temporary storage cache.
  • the direct memory access unit 50 is used to read or store data from the storage unit 10.
  • the controller unit includes: an instruction storage unit 410, an instruction processing unit 411, and a storage queue unit 413;
  • the instruction storage unit 410 is configured to store calculation instructions associated with the artificial neural network operation
  • the instruction processing unit 411 is configured to parse the calculation instructions to obtain multiple operation instructions
  • the storage queue unit 413 is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the sequence of the queue.
  • the main arithmetic processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions.
  • the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions.
  • the above-mentioned micro-instructions may be the next-level instructions of the instructions.
  • the micro-instructions can be obtained by splitting or decoding the instructions, and can be further decoded into control signals of various components, units or processing circuits.
  • the structure of the calculation instruction may be as shown in the following table.
  • the ellipsis in the above table indicates that multiple registers or immediate data can be included.
  • the calculation instruction may include: one or more operation domains and an operation code.
  • the calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Among them, each register number 0, register number 1, register number 2, register number 3, and register number 4 may be the numbers of one or more registers.
  • the above-mentioned register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data.
  • controller unit may further include:
  • the dependency processing unit 412 is configured to determine whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and the first operation instruction. If there is an association relationship between the zero operation instruction, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is fetched from the instruction storage unit and transmitted to The arithmetic unit;
  • the determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
  • the first operation instruction extract the first storage address interval of the required data (such as a matrix) in the first operation instruction, and extract the zeroth of the required matrix in the zeroth operation instruction according to the zeroth operation instruction
  • the storage address interval if the first storage address interval and the zeroth storage address interval have an overlapping area, it is determined that the first operation instruction and the zeroth operation instruction have an association relationship, as in the first storage If the address interval and the zeroth storage address interval do not have an overlapping area, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
  • steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Static random access memory SRAM Static Random-Access Memory
  • enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
  • high-bandwidth memory HBM High-Bandwidth Memory
  • hybrid storage cube HMC Hybrid Memory Cube
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • an artificial intelligence chip is also disclosed, which includes the aforementioned computing device.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • Fig. 6 shows a structural block diagram of a board card according to an embodiment of the present disclosure.
  • the board card may include other supporting components in addition to the chip 389 described above.
  • the supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
  • the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
  • an electronic device which includes the aforementioned artificial intelligence chip.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • the embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute the above method.
  • An arithmetic device comprising: a master instruction processing unit, a master functional unit, a slave instruction processing unit, and a slave functional unit,
  • the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
  • the main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit.
  • the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron
  • the master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
  • the slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result.
  • the inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
  • Clause A2 The computing device according to clause A1, wherein the main function unit is used to decompose the input data into a plurality of first sub-tensors according to the first control signal, Perform winograd positive transformation on the tensor and sum to obtain the winograd positive transformation result of the input data.
  • Clause A3 The computing device according to clause A2, wherein the input data is expressed in the form of a tensor, and the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the input data, One element in each first sub-tensor of the plurality of first sub-tensors is the same as the element at the corresponding position in the input data, and other elements are all zero.
  • Clause A4 According to the arithmetic device described in Clause A3, performing winograd forward transformation on the plurality of first subtensors and summing them to obtain the winograd forward transformation result of the input data, including:
  • the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor is: the value of the element at the first position in the first sub-tensor Is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;
  • the winograd positive transformation results of the multiple first subtensors are added to obtain the winograd positive transformation result of the input data.
  • Clause A6 The computing device according to clause A2, wherein the main functional unit includes a cache module, and the main functional unit stores the winograd conversion result of the input data in the cache module, and the cache module also uses To send the winograd positive transformation result of the input data to the slave functional unit.
  • Clause A7 The computing device according to clause A1, the computing device comprising: a main memory unit,
  • the main instruction processing unit is configured to send a first control signal to the main memory unit according to the input instruction after receiving the input instruction;
  • the main memory unit is configured to send the input data to the main function unit according to the first control signal.
  • Clause A9 The arithmetic device according to clause A8, wherein the number of the plurality of second sub-tensors is the same as the number of non-zero elements of the result of the bitwise multiplication, and each of the plurality of second sub-tensors One element of the second subtensor is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
  • Clause A10 The computing device according to clause A9, performing winograd inverse transformation on the plurality of second subtensors and summing them to obtain the winograd convolution result of the input data, including:
  • the second sub-tensor corresponding to the second sub-tensor is: the value of the element at the second position in the second sub-tensor Is 1, where the position of the second position in the second sub-tensor is the same as the position of the non-zero element in the second sub-tensor;
  • the winograd inverse transform results of the multiple second subtensors are added to obtain the winograd convolution result of the input data.
  • the slave instruction processing unit is also used to send the second control signal to the slave memory unit;
  • the slave memory unit is configured to send the winograd positive transformation result of the weight to the slave functional unit according to the second control signal.
  • Clause A13 The computing device according to any one of clauses A7-A12, wherein the computing device further includes a main memory unit, and the slave function unit is further configured to send the winograd convolution result of the input data to the master Memory unit.
  • Clause A14 The arithmetic device according to any one of clauses A7-A12, wherein the slave function unit is also used to perform post-processing on the winograd convolution result of the input data, and the post-processing includes bitwise rounding operations and Revolution operation.
  • Clause A15 An electronic device including the artificial intelligence chip as described in Clause A14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Neurology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Image Processing (AREA)
  • Complex Calculations (AREA)

Abstract

An operational apparatus and a related product. The product comprises a control module; the control module comprises: an instruction cache unit, an instruction processing unit and a storage queue unit; the instruction cache unit is used for storing a calculation instruction associated with an artificial neural network operation; the instruction processing unit is used for parsing the calculation instruction to obtain multiple operation instructions; and the storage queue unit is used for storing an instruction queue, the instruction queue comprising multiple operation instructions or calculation instructions to be executed according to the sequence of the queue. The present application can improve the operation efficiency of a related product when carrying out an operation of a neural network model.

Description

运算装置及相关产品Computing devices and related products
本公开主张2019年11月1日提交的中国专利申请号为201911060683.9的优先权,其全部内容通过引用包含于此。The present disclosure claims the priority of the Chinese patent application number 201911060683.9 filed on November 1, 2019, and the entire content of which is incorporated herein by reference.
技术领域Technical field
本公开涉及数据处理技术领域,特别是涉及一种运算装置及相关产品。The present disclosure relates to the field of data processing technology, and in particular to a computing device and related products.
背景技术Background technique
在人工智能技术领域,神经网络算法是最近非常流行的一种机器学习算法,在各种领域中都取得了非常好的效果,比如图像识别,语音识别,自然语言处理等。随着神经网络算法的发展,算法的复杂度也越来越高,为了提高识别度,模型的规模也在逐渐增大。用GPU和CPU处理起这些大规模的模型,要花费大量的计算时间,并且耗电量很大。In the field of artificial intelligence technology, neural network algorithm is a very popular machine learning algorithm recently, and has achieved very good results in various fields, such as image recognition, speech recognition, natural language processing, etc. With the development of neural network algorithms, the complexity of the algorithm is getting higher and higher. In order to improve the recognition, the scale of the model is gradually increasing. Using GPU and CPU to process these large-scale models requires a lot of computing time and consumes a lot of power.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种能够减小计算量、节约计算时间、节能的运算装置及相关产品。Based on this, it is necessary to provide a computing device and related products that can reduce the amount of calculation, save calculation time, and save energy in response to the above technical problems.
根据本公开的一方面,提供了一种运算装置,所述运算装置包括:主指令处理单元、主功能单元、从指令处理单元、以及从功能单元,According to an aspect of the present disclosure, there is provided an arithmetic device including: a master instruction processing unit, a master functional unit, a slave instruction processing unit, and a slave functional unit,
其中,所述主指令处理单元用于在接收到输入指令后,根据所述输入指令向所述主功能单元发送第一控制信号;Wherein, the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
所述主功能单元用于根据所述第一控制信号将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果,向所述从功能单元发送所述输入数据的winograd正变换结果,The main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit. The winograd positive transformation result of the input data,
其中,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果;Wherein, the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron;
所述主指令处理单元还用于向所述从指令处理单元发送第二控制信号,所述从指令处理单元用于将所述第二控制信号发送给从功能单元;The master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
所述从功能单元用于根据所述第二控制信号对输入神经元的winograd正变换结果和权值的winograd正变换结果进行对位乘得到对位乘结果,将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。The slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result. The inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
根据本公开的另一方面,提供了一种人工智能芯片,所述芯片包括如上所述的运算装置According to another aspect of the present disclosure, there is provided an artificial intelligence chip, the chip including the arithmetic device described above
根据本公开的另一方面,提供了一种电子设备,其特征在于,所述电子设备包括如上所述的人工智能芯片。According to another aspect of the present disclosure, there is provided an electronic device, characterized in that the electronic device includes the artificial intelligence chip as described above.
根据本公开的运算装置,通过将输入数据的winograd正变换拆解为求和运算,进行计算得到所述输入数据的winograd正变换结果,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果,然后对进行输入神经元的winograd正变换结果和权值的winograd正变换结果对位乘得到对位乘结果,通过将对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。根据本公开的运算装置,将乘法运算拆解为求和运算可以节约计算时间、减少能 耗。According to the arithmetic device of the present disclosure, the winograd forward transformation result of the input data is obtained by calculating the winograd forward transformation result of the input data by disassembling the winograd forward transformation of the input data into a summation operation, and the winograd forward transformation result of the input data includes the winograd of the input neuron The result of the positive transformation, and then multiply the result of the winograd positive transformation of the input neuron and the result of the weight of the winograd positive transformation to obtain the result of the parallel multiplication. The inverse winograd transform of the result of the alignment multiply is disassembled into a summation operation, and Perform calculation to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。According to the following detailed description of exemplary embodiments with reference to the accompanying drawings, other features and aspects of the present disclosure will become clear.
附图说明Description of the drawings
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。The drawings included in the specification and constituting a part of the specification together with the specification illustrate exemplary embodiments, features, and aspects of the present disclosure, and are used to explain the principle of the present disclosure.
图1示出根据本公开一实施例的运算装置的框图;Fig. 1 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure;
图2示出根据本公开一实施例的根据第一子张量计算得到输入数据的winograd正变换结果的流程图;FIG. 2 shows a flowchart of a winograd positive transformation result of input data calculated according to a first subtensor according to an embodiment of the present disclosure;
图3示出根据本公开一实施例的运算装置的框图;Fig. 3 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure;
图4示出根据本公开一实施例的对对位乘结果进行winograd逆变换的流程图;Fig. 4 shows a flow chart of performing winograd inverse transformation on a bit multiplication result according to an embodiment of the present disclosure;
图5示出根据本公开一实施例的处理器的示意图;Fig. 5 shows a schematic diagram of a processor according to an embodiment of the present disclosure;
图6示出根据本公开实施例的板卡的结构框图。Fig. 6 shows a structural block diagram of a board according to an embodiment of the present disclosure.
具体实施方式Detailed ways
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are part of the embodiments of the present disclosure, rather than all of the embodiments. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative work shall fall within the protection scope of the present disclosure.
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that the terms "first", "second", "third" and "fourth" in the claims, specification and drawings of the present disclosure are used to distinguish different objects, rather than to describe a specific order. . The terms "comprising" and "comprising" used in the specification and claims of the present disclosure indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or more other features, wholes The existence or addition of, steps, operations, elements, components, and/or their collections.
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terms used in this specification of the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. As used in the specification and claims of the present disclosure, unless the context clearly indicates otherwise, the singular forms "a", "an" and "the" are intended to include plural forms. It should be further understood that the term "and/or" used in the specification and claims of the present disclosure refers to any combination of one or more of the items listed in association and all possible combinations, and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" can be interpreted as "when" or "once" or "in response to determination" or "in response to detection" depending on the context. Similarly, the phrase "if determined" or "if detected [described condition or event]" can be interpreted as meaning "once determined" or "in response to determination" or "once detected [described condition or event]" depending on the context ]" or "in response to detection of [condition or event described]".
winograd卷积是一种基于多项式插值算法的卷积加速实现方式。它通过对卷积操作的两个输入:神经元、权值进行一定规模切分后分别进行线性变换(winograd正变换),再将变换后的神经元和权值进行对位乘法,最后对对位乘法结果再次进行线性变换(winograd逆变换)得到与原卷积操作等价的卷积结果。Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: neurons and weights are divided into a certain scale and then linearly transformed (winograd positive transformation), and then the transformed neurons and weights are multiplied by bit, and finally the pair The bit multiplication result is linearly transformed again (winograd inverse transformation) to obtain a convolution result equivalent to the original convolution operation.
winograd变换的表达式如下所示:The expression of winograd transformation is as follows:
对于一维的神经元和权值:
Figure PCTCN2020114057-appb-000001
For one-dimensional neurons and weights:
Figure PCTCN2020114057-appb-000001
对于二维的神经元和权值:
Figure PCTCN2020114057-appb-000002
For two-dimensional neurons and weights:
Figure PCTCN2020114057-appb-000002
其中,g表示权值,G表示权值对应的左乘正变换矩阵,G T表示权值对应的右乘正变换矩阵,d表示输入神经元,B表示输入神经元对应的右乘正变换矩阵,B T表示输入神经元对应的左乘正变换矩阵,
Figure PCTCN2020114057-appb-000003
表示对位乘运算,A表示右乘逆变换矩阵,A T表示左乘逆变换矩阵。对于不同维度的输入神经元,都有与其相对应的B和B T;同样的,对于不同维度的权值,都有与其相对应的G和G T
Among them, g represents the weight value, G represents the left multiplication positive transformation matrix corresponding to the weight value, G T represents the right multiplication positive transformation matrix corresponding to the weight value, d represents the input neuron, and B represents the right multiplication positive transformation matrix corresponding to the input neuron , B T represents the left multiplication positive transformation matrix corresponding to the input neuron,
Figure PCTCN2020114057-appb-000003
Represents the bitwise multiplication operation, A represents the right multiplication inverse transformation matrix, and AT represents the left multiplication inverse transformation matrix. For input neurons of different dimensions, there are B and B T corresponding to them ; similarly, for the weights of different dimensions, there are G and G T corresponding to them .
通过winograd卷积替代原始卷积操作能够带来硬件能效比和运算时间上的较大收益,同时也可以在不增加、或者增加较少的硬件开销的情况下实现更高的神经网络性能。但是,winograd卷积的弊端仍然较为明显,大量的乘法运算在计算过程中仍然消耗较长的运算时间。Replacing the original convolution operation by winograd convolution can bring greater benefits in hardware energy efficiency and computing time, and at the same time, higher neural network performance can be achieved without increasing or increasing less hardware overhead. However, the disadvantages of winograd convolution are still more obvious, and a large number of multiplication operations still consume a long time in the calculation process.
为了解决上述技术问题,本公开提供了一种运算装置,该运算装置可以将winograd卷积过程中的乘法运算拆解为加法运算,从而节约计算时间、减少能耗。In order to solve the above technical problems, the present disclosure provides an arithmetic device that can disassemble the multiplication operation in the winograd convolution process into an addition operation, thereby saving calculation time and reducing energy consumption.
图1示出根据本公开一实施例的运算装置的框图。如图1所示,本公开提供的运算装置可以包括主指令处理单元、主功能单元以及主内存单元,从指令处理单元、从功能单元以及从内存单元。Fig. 1 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure. As shown in FIG. 1, the computing device provided by the present disclosure may include a master instruction processing unit, a master function unit, and a master memory unit, a slave instruction processing unit, a slave function unit, and a slave memory unit.
其中,所述主指令处理单元用于在接收到输入指令后,根据所述输入指令向所述主功能单元发送第一控制信号;Wherein, the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
所述主功能单元用于根据所述第一控制信号将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果,向所述从功能单元发送所述输入数据的winograd正变换结果,The main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit. The winograd positive transformation result of the input data,
其中,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果;Wherein, the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron;
所述主指令处理单元还用于向所述从指令处理单元发送第二控制信号,所述从指令处理单元用于将所述第二控制信号发送给从功能单元;The master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
所述从功能单元用于根据所述第二控制信号对输入神经元的winograd正变换结果和权值的winograd正变换结果进行对位乘得到对位乘结果,将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。The slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result. The inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
根据本公开的运算装置,通过将输入数据的winograd正变换拆解为求和运算,进行计算得到所述输入数据的winograd正变换结果,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果,然后对进行输入神经元的winograd正变换结果和权值的winograd正变换结果对位乘得到对位乘结果,通过将对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。根据本公开的运算装置,将乘法运算拆解为求和运算可以节约计算时间、减少能耗。According to the arithmetic device of the present disclosure, the winograd forward transformation result of the input data is obtained by calculating the winograd forward transformation result of the input data by disassembling the winograd forward transformation of the input data into a summation operation, and the winograd forward transformation result of the input data includes the winograd of the input neuron The result of the positive transformation, and then multiply the result of the winograd positive transformation of the input neuron and the result of the weight of the winograd positive transformation to obtain the result of the parallel multiplication. The inverse winograd transform of the result of the alignment multiply is disassembled into a summation operation, and Perform calculation to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
其中,上述的输入指令可以是指“WINO_CONV”指令,包含winograd正变换、对位乘和winograd逆变换的过程。输入指令中可以携带有输入指令对应的操作以及操作数的信息,操作数的信息可以包括:操作数的地址信息、操作数的尺寸等。举例来说,操作数的信息可以包括输入神经元的地址信息、权值的地址信息以及得到的输出操作数(下一层的输入操作数,或者说下一层的输入神经元)要存储 的地址信息等等。Wherein, the above-mentioned input instruction may refer to the "WINO_CONV" instruction, which includes the processes of winograd forward transformation, bitwise multiplication, and winograd inverse transformation. The input instruction may carry information about the operation and operand corresponding to the input instruction, and the operand information may include: the address information of the operand, the size of the operand, and so on. For example, the operand information can include the address information of the input neuron, the address information of the weight, and the output operand obtained (the input operand of the next layer, or the input neuron of the next layer) to be stored Address information and so on.
对于神经网络中的任意一层,该层的输入数据可以包括输入神经元,该层的输入神经元可以是上一层的输出结果,对于神经网络中的第一层,输入神经元可以为初始输入数据。初始输入数据可以是图像数据、声音数据或者视频数据等。For any layer in the neural network, the input data of this layer can include input neurons, the input neurons of this layer can be the output results of the previous layer, and for the first layer in the neural network, the input neurons can be the initial Input data. The initial input data can be image data, sound data, or video data.
下面分别对winograd正变换、对位乘和winograd逆变换的过程进行说明。The processes of winograd forward transformation, bit multiplication, and winograd inverse transformation are described below.
winograd正变换winograd is transforming
主指令处理单元还用于在接收到输入指令后,根据输入指令向所述主内存单元发送第一控制信号;所述主内存单元用于根据所述第一控制信号向所述主功能单元发送所述输入数据。The main instruction processing unit is further configured to send a first control signal to the main memory unit according to the input instruction after receiving an input instruction; the main memory unit is configured to send a first control signal to the main functional unit according to the first control signal The input data.
在一种可能的实现方式中,主指令处理单元接收到输入指令后,可以对输入指令进行译码、解析等,获得操作、操作数的地址信息等。然后主指令处理单元可以根据解析得到的信息向主内存单元和主功能单元发送第一控制信号。In a possible implementation manner, after receiving the input instruction, the main instruction processing unit can decode and parse the input instruction to obtain the address information of the operation and the operand. Then the main instruction processing unit may send the first control signal to the main memory unit and the main function unit according to the information obtained by the analysis.
主内存单元接收到第一控制信号后,可以根据第一控制信号获取输入数据,输入数据可以包括输入神经元等,输入数据可以是以张量形式表示的数据。以输入数据为图像数据为例,输入数据可以表示为NHWC(batch,height,width,channels)的形式,N表示图像的数量,HW可以分别表示在图像的高度和宽度方向的像素个数,C可以表示通道数,例如,C可以表示RGB(Red,Green,Blue)三个通道。需要说明的是,以上表示方式仅仅是本公开的一个示例,本公开不限于此。After receiving the first control signal, the main memory unit may obtain input data according to the first control signal. The input data may include input neurons, etc., and the input data may be data represented in the form of a tensor. Taking the input data as image data as an example, the input data can be expressed in the form of NHWC (batch, height, width, channels), N represents the number of images, HW can respectively represent the number of pixels in the height and width directions of the image, C It can represent the number of channels, for example, C can represent three channels of RGB (Red, Green, Blue). It should be noted that the above representation is only an example of the present disclosure, and the present disclosure is not limited to this.
主内存单元在获取输入数据后可以将输入数据发送给主功能单元。The main memory unit can send the input data to the main function unit after obtaining the input data.
主功能单元在接收到第一控制信号以及输入数据后,可以根据第一控制信号中的操作以及输入数据进行运算,将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果。After receiving the first control signal and the input data, the main functional unit can perform calculations according to the operation in the first control signal and the input data, and disassemble the winograd forward transformation of the input data into a summation operation, and perform calculations to obtain the The winograd of the input data is transforming the result.
具体过程可以为,主功能单元用于根据所述第一控制信号将所述输入数据拆解为多个第一子张量,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果。The specific process may be that the main function unit is used to disassemble the input data into a plurality of first sub-tensors according to the first control signal, and perform winograd forward transformation on the plurality of first sub-tensors and sum them. Obtain the winograd positive transformation result of the input data.
在一种可能的实现方式中,所述多个第一子张量的个数与所述输入数据的元素的个数相同。所述多个第一子张量中的每个第一子张量中有一个元素与所述输入数据中的对应位置的元素相同、其他元素均为0。In a possible implementation manner, the number of the plurality of first sub-tensors is the same as the number of elements of the input data. In each of the plurality of first sub-tensors, one element in each first sub-tensor is the same as the element at the corresponding position in the input data, and other elements are all zero.
或者,在另一种可能的实现方式中,所述多个第一子张量的个数与所述输入数据的不为0的元素的个数相同,所述多个第一子张量中的每个第一子张量中有一个元素与所述输入数据中的对应位置的元素相同、其他元素均为0。Or, in another possible implementation manner, the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the input data, and each of the plurality of first sub-tensors One element in the first sub-tensor is the same as the element at the corresponding position in the input data, and the other elements are all 0.
举例来说,假设输入神经元表示为:For example, suppose the input neuron is represented as:
Figure PCTCN2020114057-appb-000004
输入神经元为4×4的矩阵,包括16个元素,因此,可以将输入数据拆解为16个第一子张量。
Figure PCTCN2020114057-appb-000004
The input neuron is a 4×4 matrix including 16 elements. Therefore, the input data can be decomposed into 16 first sub-tensors.
那么,按照本公开的拆解方式,16个第一子张量分别为:Then, according to the disassembly method of the present disclosure, the 16 first sub-tensors are:
Figure PCTCN2020114057-appb-000005
Figure PCTCN2020114057-appb-000005
每个第一子张量中有一个元素与所述输入数据中的对应位置的元素相同、其他元素均为0是指:以第一子张量d 00为例,在第一行第一列位置的元素与输入神经元在第一行第一列的位置的元素相同,其他元素都为0,其他第一子张量也有相同的属性。 There is an element in each first subtensor that is the same as the element at the corresponding position in the input data, and the other elements are all 0. This means: taking the first subtensor d 00 as an example, the position in the first row and first column is The element is the same as the element at the position of the input neuron in the first row and first column. Other elements are all 0, and the other first subtensors also have the same attributes.
需要说明的是,以上拆解方式仅仅是本公开的一些示例,不以任何方式限制本公开,例如,如果输入数据中具有值为0的元素,拆解得到的第一子张量的数量可以少于输入数据的元素的个数,例如,多个第一子张量的个数与所述输入数据的不为0的元素的个数相同。It should be noted that the above disassembly methods are only some examples of the present disclosure, and do not limit the present disclosure in any way. For example, if the input data has an element with a value of 0, the number of first subtensors obtained by the disassembly can be The number of elements less than the input data, for example, the number of multiple first subtensors is the same as the number of non-zero elements of the input data.
图2示出根据本公开一实施例的根据第一子张量计算得到输入数据的winograd正变换结果的流程图。如图2所示,在一种可能的实现方式中,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果,可以包括以下过程:Fig. 2 shows a flow chart of a winograd positive transformation result of input data calculated according to a first subtensor according to an embodiment of the present disclosure. As shown in FIG. 2, in a possible implementation manner, performing winograd forward transformation on the multiple first subtensors and summing them to obtain the winograd forward transformation result of the input data may include the following process:
步骤S21,获取第一子张量对应的第一元子张量的winograd正变换结果;其中,第一子张量对应的第一元子张量为:在第一元子张量中第一位置的元素的值为1,其中,第一位置在第一元子张量中所处的位置与第一子张量中的非0元素所处的位置相同;Step S21: Obtain the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor; wherein, the first sub-tensor corresponding to the first sub-tensor is: the first position in the first sub-tensor The value of the element is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;
步骤S22,将第一子张量中不为0的元素值作为系数乘以对应的第一元子张量的winograd正变换结果,得到第一子张量的winograd正变换结果;Step S22, multiplying the value of an element in the first sub-tensor that is not 0 as a coefficient by the winograd positive transformation result of the corresponding first-element sub-tensor to obtain the winograd positive transformation result of the first sub-tensor;
步骤S23,将多个第一子张量的winograd正变换结果相加得到所述输入数据的winograd正变换结果。Step S23, adding the winograd positive transformation results of the multiple first subtensors to obtain the winograd positive transformation result of the input data.
对于步骤S21,仍然以第一子张量d 00为例,d 00对应的第一元子张量可以为
Figure PCTCN2020114057-appb-000006
也就是说,第一元子张量是将第一子张量中的非0元素值提取出来,非0元素的值可以作为第一元子张量的系数。
For step S21, still taking the first sub-tensor d 00 as an example, the first-element sub-tensor corresponding to d 00 can be
Figure PCTCN2020114057-appb-000006
In other words, the first sub-tensor is to extract the values of non-zero elements in the first sub-tensor, and the values of non-zero elements can be used as coefficients of the first sub-tensor.
其中,第一子张量对应的第一元子张量的winograd正变换结果可以是通过以下过程预先得到的:对于每一个第一子张量,将该第一子张量对应的第一元子张量左边乘以正变换左乘矩阵、右边乘以正变换右乘矩阵得到第一元子张量的winograd正变换结果。Among them, the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor can be obtained in advance through the following process: For each first sub-tensor, the first sub-tensor corresponding to the first sub-tensor The left side of the sub-tensor is multiplied by the positive transformation, the left-multiplied matrix, and the right is multiplied by the positive transformation, and the right-multiplied matrix is used to obtain the winograd positive transformation result of the first sub-tensor.
对于不同尺寸的矩阵,对应的第一元子张量的形式是确定的,对应的正变换左乘矩阵和正变换右乘矩阵也是确定的。For matrices of different sizes, the form of the corresponding first element sub-tensor is determined, and the corresponding positive transformation left-multiplication matrix and forward transformation right-multiplication matrix are also determined.
因此,可以预先计算出第一元子张量的winograd正变换结果,具体过程如上所述。举例来说,仍然以d 00为例,其对应的第一元子张量的winograd正变换结果为: Therefore, the winograd positive transformation result of the first sub-tensor can be calculated in advance, and the specific process is as described above. For example, still taking d 00 as an example, the corresponding winograd positive transformation result of the first sub-tensor is:
Figure PCTCN2020114057-appb-000007
Figure PCTCN2020114057-appb-000007
再比如,以d 01为例,其对应的第一元子张量的winograd正变换结果为: For another example, taking d 01 as an example, the winograd positive transformation result of the corresponding first-element sub-tensor is:
Figure PCTCN2020114057-appb-000008
Figure PCTCN2020114057-appb-000008
由于正变换左乘矩阵和正变换右乘矩阵的元素值都是0、±1,第一元子张量的元素值为0或1,第一元子张量的winograd正变换结果中的元素也是0、±1。因此,可以将矩阵乘操作拆解为加法操作。Since the element values of the positive transformation left multiplication matrix and the positive transformation right multiplication matrix are both 0 and ±1, the element value of the first sub-tensor is 0 or 1, and the element in the winograd positive transformation result of the first sub-tensor is also 0, ±1. Therefore, the matrix multiplication operation can be broken down into an addition operation.
计算第一元子张量的winograd正变换结果的过程涉及较多的乘法运算,通过本公开的方式,可以将预先计算好的各种规模的第一元子张量的winograd正变换结果保存在运算装置中,这样,在实际的运算过程中,可以直接获取,而不需要重复运算,从而缩短计算时间、节约计算资源。The process of calculating the winograd positive transformation result of the first element sub-tensor involves more multiplication operations. Through the method of the present disclosure, the pre-calculated winograd positive transformation results of the first element subtensor of various scales can be stored in In the computing device, in this way, in the actual computing process, it can be directly obtained without repeated computing, thereby shortening computing time and saving computing resources.
在获得第一子张量对应的第一元子张量的winograd正变换结果,可以将第一子张量中不为0的元素值乘以对应的第一元子张量的winograd正变换结果,就可以得到第一子张量的winograd正变换结果。举例来说,仍然以d 00为例,其对应的winograd正变换结果为:
Figure PCTCN2020114057-appb-000009
After obtaining the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor, the value of the non-zero element in the first sub-tensor can be multiplied by the winograd positive transformation result of the corresponding first sub-tensor, then The winograd positive transformation result of the first subtensor can be obtained. For example, still taking d 00 as an example, the corresponding winograd positive transformation result is:
Figure PCTCN2020114057-appb-000009
再比如,以d 01为例,d 01的winograd正变换结果为
Figure PCTCN2020114057-appb-000010
For another example, taking d 01 as an example, the winograd positive transformation result of d 01 is
Figure PCTCN2020114057-appb-000010
通过以上过程计算得到所有第一子张量的winograd正变换结果,将多个第一子张量的winograd正变换结果相加,即可得到所述输入数据的winograd正变换结果。The winograd positive transformation results of all the first sub-tensors are calculated through the above process, and the winograd positive transformation results of multiple first sub-tensors are added to obtain the winograd positive transformation results of the input data.
Figure PCTCN2020114057-appb-000011
Figure PCTCN2020114057-appb-000011
Figure PCTCN2020114057-appb-000012
Figure PCTCN2020114057-appb-000012
由于转换得到的第一元子张量的winograd正变换结果中的元素也是0、±1,因此,上述等式(1)、(2)右侧仅涉及求和运算。Since the elements in the winograd positive transformation result of the first element subtensor obtained by the conversion are also 0 and ±1, the right side of the above equations (1) and (2) only involves the summation operation.
根据本公开上述实施方式可知,通过将输入数据进行拆解得到多个第一子张量,根据预先计算得到的第一子张量对应的第一元子张量的winograd正变换结果以及第一子张量的非0元素值即可进行求和运算得到输入数据的winograd正变换结果。根据本公开的上述运算装置,将乘法运算拆解为求和运算可以节约计算时间、减少能耗。According to the above-mentioned embodiments of the present disclosure, it can be known that multiple first sub-tensors are obtained by disassembling the input data, and the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor obtained in advance and the first The non-zero element value of the subtensor can be summed to obtain the winograd positive transformation result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
在一种可能的实现方式中,所述主功能单元包括缓存模块,所述主功能单元将所述输入数据的winograd正变换结果存储到所述缓存模块中。In a possible implementation manner, the main functional unit includes a cache module, and the main functional unit stores the winograd positive transformation result of the input data in the cache module.
图3示出根据本公开一实施例的运算装置的框图。如图3所示,所述主功能单元可以包括主处理模块以及缓存模块,主处理模块可以用于执行上文提到的所有实施方式的主功能单元根据所述第一控制信号将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果的过程。主功能单元可以将所述输入数据的winograd正变换结果存储到所述缓存模块中,由缓存模块将输入数据的winograd正变换结果发送给从功能单元,从而便于运算装置的从功能单元对winograd正变换结果进行对位乘以及进行winograd逆变换的过程。Fig. 3 shows a block diagram of an arithmetic device according to an embodiment of the present disclosure. As shown in FIG. 3, the main function unit may include a main processing module and a cache module, and the main processing module may be used to execute the main function unit of all the above-mentioned embodiments to input data according to the first control signal. The winograd positive transformation is decomposed into a summation operation, and the process of calculating the winograd positive transformation result of the input data is performed. The main functional unit can store the winograd positive conversion result of the input data in the cache module, and the cache module sends the winograd positive conversion result of the input data to the slave functional unit, so that the slave functional unit of the computing device can check the winograd positive conversion result. The transformation result is a process of bitwise multiplication and winograd inverse transformation.
如图3所示,主指令处理单元还用于向从指令处理单元发送第二控制信号,比如说,在winograd正变换过程完成后,输入数据的winograd正变换结果已就绪,主指令处理单元可以根据WINO_MIT指令向从指令处理单元发送第二控制信号,主功能单元(具体为,主功能单元中的缓存模块)还用于将输入数据的winograd正变换结果发送给从功能单元。第二控制信号可以用于指示从指令处理单元控制从功能单元对输入数据的winograd正变换结果进行进一步的处理。As shown in Figure 3, the master instruction processing unit is also used to send a second control signal to the slave instruction processing unit. For example, after the winograd positive transformation process is completed, the winograd positive transformation result of the input data is ready, and the master instruction processing unit can The second control signal is sent to the slave instruction processing unit according to the WINO_MIT instruction, and the master functional unit (specifically, the cache module in the master functional unit) is also used to send the winograd positive transformation result of the input data to the slave functional unit. The second control signal may be used to instruct the slave command processing unit to control the slave functional unit to further process the winograd conversion result of the input data.
对位乘、winograd逆变换Counterpoint multiplication, winograd inverse transform
具体地,所述从指令处理单元用于接收主指令处理单元发送的第二控制信号,并将所述第二控制信号发送给所述从功能单元和从内存单元。从内存单元用于根据所述第二控制信号向所述从功能单元发送权值的winograd正变换结果。从功能单元用于接收主功能单元发送的输入数据的winograd正变换结果,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果。Specifically, the slave instruction processing unit is configured to receive the second control signal sent by the master instruction processing unit, and send the second control signal to the slave functional unit and the slave memory unit. The slave memory unit is configured to send the winograd positive transformation result of the weight to the slave functional unit according to the second control signal. The slave functional unit is used to receive the winograd positive transformation result of the input data sent by the main functional unit, and the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron.
其中,所述权值的winograd正变换结果可以是预先计算好的,权值的winograd正变换结果的计算方式可以采用传统的矩阵乘法运算,也可以参照上文提到的拆解为求和运算进行计算得到的。Wherein, the winograd positive transformation result of the weight can be pre-calculated, and the calculation method of the winograd positive transformation result of the weight can adopt a traditional matrix multiplication operation, or refer to the disassembly mentioned above as a summation operation. Calculated.
比如说,将所述权值拆解为多个第一子张量,对所述多个第一子张量进行winograd正变换并求和得到权值的winograd正变换结果。For example, the weights are disassembled into a plurality of first sub-tensors, and winograd positive transformations are performed on the plurality of first sub-tensors and summed to obtain the winograd positive transformation results of the weights.
假设权值可以表示为:Assuming that the weight can be expressed as:
Figure PCTCN2020114057-appb-000013
权值为3×3的矩阵,包括9个元素,因此,可以将输入数据拆解为9个第一子张量。
Figure PCTCN2020114057-appb-000013
A matrix with a weight of 3×3 includes 9 elements. Therefore, the input data can be decomposed into 9 first sub-tensors.
那么,按照本公开的拆解方式,9个第一子张量分别为:Then, according to the disassembly method of the present disclosure, the 9 first sub-tensors are:
Figure PCTCN2020114057-appb-000014
Figure PCTCN2020114057-appb-000014
同样的,每个第一子张量中有一个元素与权值中的对应位置的元素相同、其他元素均为0。Similarly, one element in each first subtensor is the same as the element at the corresponding position in the weight, and the other elements are all zero.
参见步骤S21-步骤S23的过程,可以计算得到权值的winograd正变换结果,不在赘述。Referring to the process of step S21-step S23, the winograd positive transformation result of the weight can be calculated, which will not be repeated.
所述功能单元用于根据所述第二控制信号对输入神经元的winograd正变换结果和权值的winograd正变换结果进行对位乘得到对位乘结果,将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。The functional unit is used to perform bit-multiplication on the winograd forward transformation result of the input neuron and the winograd forward transformation result of the weight value according to the second control signal to obtain the bit-on multiplication result, and to inverse the winograd result of the bit-on multiplication result The transformation and disassembly are summation operations, and calculations are performed to obtain the winograd convolution result of the input data.
其中,对位乘可以是指对两个张量对应位置的数据相乘得到的数据作为对位乘结果中相应位置的值。Wherein, the bitwise multiplication may refer to the data obtained by multiplying the data at the corresponding positions of the two tensors as the value of the corresponding position in the bitwise multiplication result.
假设输入神经元的winograd正变换结果B Td 4×4B可以表示为:
Figure PCTCN2020114057-appb-000015
Assuming that the winograd positive transformation result B T d 4×4 B of the input neuron can be expressed as:
Figure PCTCN2020114057-appb-000015
权值的winograd正变换结
Figure PCTCN2020114057-appb-000016
可以表示为:
Figure PCTCN2020114057-appb-000017
Weighted winograd positive transformation result
Figure PCTCN2020114057-appb-000016
It can be expressed as:
Figure PCTCN2020114057-appb-000017
对位乘结果可以为:The result of counter multiplication can be:
Figure PCTCN2020114057-appb-000018
Figure PCTCN2020114057-appb-000018
输入数据的winograd卷积结果可以表示为
Figure PCTCN2020114057-appb-000019
本公开的从功能处理单元可以将
Figure PCTCN2020114057-appb-000020
拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果,从而可以进一步节约计算时间、减少能耗。
The winograd convolution result of the input data can be expressed as
Figure PCTCN2020114057-appb-000019
The slave function processing unit of the present disclosure can combine
Figure PCTCN2020114057-appb-000020
The disassembly is a summation operation, and calculation is performed to obtain the winograd convolution result of the input data, which can further save calculation time and reduce energy consumption.
具体过程与上述winograd正变换拆解的方式类似,在一种可能的实现方式中,所述从功能单元用于将所述对位乘结果拆解为多个第二子张量,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果。The specific process is similar to the above-mentioned winograd forward transform disassembly method. In a possible implementation manner, the slave function unit is used to disassemble the alignment multiplication result into a plurality of second sub-tensors. The multiple second sub-tensors are subjected to winograd inverse transformation and summed to obtain the winograd convolution result of the input data.
在一种可能的实现方式中,所述多个第二子张量的个数与对位乘结果的元素的个数相同,所述多个第二子张量中的每个第二子张量中有一个元素与对位乘结果中的对应位置的元素相同、其他元素均为0。In a possible implementation manner, the number of the multiple second sub-tensors is the same as the number of elements of the parametric multiplication result, and each of the multiple second sub-tensors has one The element is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
在一种可能的实现方式中,所述多个第二子张量的个数与对位乘结果的非0元素的个数相同,所述多个第二子张量中的每个第二子张量中有一个元素与对位乘结果中的对应位置的元素相同、其他元素均为0。In a possible implementation manner, the number of the plurality of second sub-tensors is the same as the number of non-zero elements in the result of the bitwise multiplication, and each second sub-tensor of the plurality of second sub-tensors One element is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
假设对位乘结果为:Assume that the result of counter multiplication is
Figure PCTCN2020114057-appb-000021
Figure PCTCN2020114057-appb-000021
将对位乘结果拆解为多个第二子张量,例如可以拆解为16个,16个第二子张量分别为:The result of the alignment multiplication is disassembled into multiple second sub-tensors, for example, it can be disassembled into 16, and the 16 second sub-tensors are:
Figure PCTCN2020114057-appb-000022
Figure PCTCN2020114057-appb-000022
在拆解完后,可以对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果。After the disassembly is completed, winograd inverse transformation can be performed on the multiple second sub-tensors and summed to obtain the winograd convolution result of the input data.
图4示出根据本公开一实施例的对对位乘结果进行winograd逆变换的流程图。如图4所示,在一种可能的实现方式中,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果,可以包括以下过程:Fig. 4 shows a flow chart of performing winograd inverse transformation on a bit multiplication result according to an embodiment of the present disclosure. As shown in FIG. 4, in a possible implementation manner, performing winograd inverse transformation on the multiple second subtensors and summing them to obtain the winograd convolution result of the input data may include the following process:
步骤S41,获取第二子张量对应的第二元子张量的winograd逆变换结果;其中,第二子张量对应的第二元子张量为:在第二元子张量中第二位置的元素的值为1,其中,第二位置在第二元子张量中所处的位置与第二子张量中的非0元素所处的位置相同;Step S41: Obtain the winograd inverse transform result of the second sub-tensor corresponding to the second sub-tensor; wherein, the second sub-tensor corresponding to the second sub-tensor is: the second position in the second sub-tensor The value of the element is 1, where the position of the second position in the second sub-tensor is the same as the position of the non-zero element in the second sub-tensor;
步骤S42,将第二子张量中不为0的元素值作为系数乘以对应的第二元子张量的winograd逆变换结果,得到第二子张量的winograd逆变换结果;Step S42, multiplying the non-zero element value of the second sub-tensor as a coefficient by the winograd inverse transform result of the corresponding second-element sub-tensor to obtain the winograd inverse transform result of the second sub-tensor;
步骤S43,将多个第二子张量的winograd逆变换结果相加得到所述输入数据的winograd卷积结果。Step S43: Add the winograd inverse transform results of the multiple second sub-tensors to obtain the winograd convolution result of the input data.
第二子张量对应的第二元子张量确定的方式和上文中第一元子张量确定的方式相同,不再赘述。其中,第二元子张量的winograd逆变换结果是通过以下过程预先得到的:对于每一个第二子张量,将该第二子张量对应的第二元子张量左边乘以逆变换左乘矩阵、右边乘以逆变换右乘矩阵得到第二元子张量的winograd逆变换结果。The method for determining the second meta-sub-tensor corresponding to the second sub-tensor is the same as the method for determining the first meta-sub-tensor above, and will not be repeated here. Among them, the winograd inverse transform result of the second sub-tensor is obtained in advance through the following process: For each second sub-tensor, the left side of the second sub-tensor corresponding to the second sub-tensor is multiplied by the inverse transform Multiplying the matrix on the left, multiplying the matrix on the right by the inverse transformation, and multiplying the matrix on the right to obtain the winograd inverse transformation result of the second element subtensor.
对于不同尺寸的矩阵,对应的第二元子张量的形式是确定的,对应的逆变换左乘矩阵和逆变换右乘矩阵也是确定的。因此,可以预先计算出第二元子张量的winograd逆变换结果,具体过程如上所述。对于本文上述列举的示例,逆变换左乘矩阵为2×4的矩阵,例如可以为:
Figure PCTCN2020114057-appb-000023
For matrices of different sizes, the form of the corresponding second-element sub-tensor is determined, and the corresponding inverse transform left multiplication matrix and inverse transform right multiplication matrix are also determined. Therefore, the winograd inverse transformation result of the second sub-tensor can be calculated in advance, and the specific process is as described above. For the examples listed in this article, the left multiplication matrix of the inverse transformation is a 2×4 matrix, for example:
Figure PCTCN2020114057-appb-000023
逆变换右乘矩阵为4×2的矩阵,例如可以为:
Figure PCTCN2020114057-appb-000024
The inverse transformation right multiplication matrix is a 4×2 matrix, for example:
Figure PCTCN2020114057-appb-000024
逆变换矩阵的维度可以根据输入神经元的维度以及权值的维度和卷积步长确定,上文仅仅是一个示例,不以任何方式限制本公开。The dimension of the inverse transformation matrix can be determined according to the dimension of the input neuron and the dimension of the weight value and the convolution step length. The above is only an example, and the present disclosure is not limited in any way.
逆变换矩阵由
Figure PCTCN2020114057-appb-000025
构成,因此逆变换的矩阵乘操作可以拆解为加法和移位操作实现。将逆变换矩阵乘以第二元子张量即可得到第二元子张量的winograd逆变换结果,第二元子张量的winograd逆变换结果内的元素值由
Figure PCTCN2020114057-appb-000026
等构成,分数可以通过简单的移位操作计算,相比于乘法操作仍然可以节省计算时间。
The inverse transformation matrix is given by
Figure PCTCN2020114057-appb-000025
Therefore, the matrix multiplication operation of the inverse transformation can be broken down into addition and shift operations. Multiply the inverse transformation matrix by the second-element sub-tensor to obtain the winograd inverse transformation result of the second-element sub-tensor. The element value in the winograd inverse transformation result of the second-element sub-tensor is determined by
Figure PCTCN2020114057-appb-000026
With other configurations, fractions can be calculated by simple shift operations, which can still save calculation time compared to multiplication operations.
对于步骤S42和S43的具体过程可以参照上文的步骤S22和S23,只不过第二元子张量的winograd逆变换结果不完全为0、±1,但分数可以通过简单的移位操作计算,相比于乘法操作,本公开将普通的逆变换过程拆解后仍然可以实现节约计算时间、减少能耗的效果。For the specific process of steps S42 and S43, please refer to the above steps S22 and S23, except that the winograd inverse transformation result of the second element subtensor is not completely 0, ±1, but the score can be calculated by a simple shift operation, Compared with the multiplication operation, the present disclosure can still achieve the effects of saving calculation time and reducing energy consumption after disassembling the ordinary inverse transformation process.
根据本公开上述实施方式可知,通过将对位乘结果进行拆解得到多个第二子张量,根据预先计算得到的第二子张量对应的第二元子张量的winograd逆变换结果以及第二子张量的非0元素值即可进行求和运算得到输入数据的winograd卷积结果。根据本公开的上述运算装置,将乘法运算拆解为求和运算可以节约计算时间、减少能耗。According to the above-mentioned embodiments of the present disclosure, it can be known that multiple second sub-tensors are obtained by disassembling the bit-multiplication results, and the winograd inverse transform results of the second-element sub-tensors corresponding to the second sub-tensors obtained in advance and The non-zero element value of the second subtensor can be summed to obtain the winograd convolution result of the input data. According to the arithmetic device of the present disclosure, disassembling the multiplication operation into a summation operation can save calculation time and reduce energy consumption.
在一种可能的实现方式中,如图3所示,所述从功能单元可以包括对位乘模块和逆变换模块,其中,对位乘模块可以用于执行上文中的对位乘操作得到对位乘结果,可以将对位乘结果发送给逆变换模块,由逆变换模块执行上文所述的将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。In a possible implementation, as shown in FIG. 3, the slave function unit may include an alignment multiplication module and an inverse transformation module, wherein the alignment multiplication module can be used to perform the above alignment multiplication operation to obtain the alignment. The result of the bit multiplication can be sent to the inverse transform module, and the inverse transform module executes the above-mentioned winograd inverse transform of the result of the bit multiplication into a summation operation, and performs calculation to obtain the Winograd convolution result of input data.
在一种可能的实现方式中,所述从功能单元还用于对所述输入数据的winograd卷积结果进行后处理,所述后处理包括按位取整操作和转数操作。In a possible implementation manner, the slave function unit is further configured to perform post-processing on the winograd convolution result of the input data, and the post-processing includes a bitwise rounding operation and a revolution operation.
其中,取整操作可以根据设置的取整的位数对输入数据的winograd卷积结果进行取整。转数操作可以是指对输入数据的winograd卷积结果的摆放方式进行处理,例如,可以根据存储的需求改变输入数据的winograd卷积结果的摆放方式。通过对输入数据的winograd卷积结果进行后处理更有利于后续 的操作和计算。Among them, the rounding operation may round the winograd convolution result of the input data according to the set number of rounding bits. The number of revolution operations may refer to processing the arrangement of the winograd convolution result of the input data. For example, the arrangement of the winograd convolution result of the input data can be changed according to storage requirements. Post-processing the winograd convolution results of the input data is more conducive to subsequent operations and calculations.
在一种可能的实现方式中,所述从功能单元还用于将所述输入数据的winograd卷积结果发送给所述主内存单元,作为下一层卷积运算的输入神经元。In a possible implementation manner, the slave function unit is further configured to send the winograd convolution result of the input data to the main memory unit as the input neuron of the next layer of convolution operation.
根据本公开实施例的运算装置可应用于处理器中,该处理器可以是通用处理器,例如CPU(Central Processing Unit,中央处理器),也可以是用于执行人工智能运算的人工智能处理器(IPU)。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对处理器的具体类型不作限制。The computing device according to the embodiments of the present disclosure can be applied to a processor, and the processor can be a general-purpose processor, such as a CPU (Central Processing Unit, central processing unit), or an artificial intelligence processor for performing artificial intelligence operations. (IPU). Artificial intelligence operations can include machine learning operations, brain-like operations, and so on. Among them, machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on. The artificial intelligence processor may, for example, include GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), DSP (Digital Signal Process, digital signal processing unit), field programmable gate array (Field-Programmable Gate Array, FPGA) One or a combination of chips. The present disclosure does not limit the specific types of processors.
在一种可能的实现方式中,本公开中所提及的处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。In a possible implementation manner, the processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as convolution computing tasks and pooling tasks. Or fully connected tasks, etc. The present disclosure does not limit the processing unit and the tasks run by the processing unit.
图5示出根据本公开一实施例的处理器的示意图。参阅图5,该处理器用于执行机器学习计算,该处理器包括:控制器单元141和运算单元142,其中,控制器单元141与运算单元142连接,该运算单元142包括:一个主处理电路和多个从处理电路;Fig. 5 shows a schematic diagram of a processor according to an embodiment of the present disclosure. Referring to FIG. 5, the processor is used to perform machine learning calculations. The processor includes a controller unit 141 and an arithmetic unit 142. The controller unit 141 is connected to the arithmetic unit 142. The arithmetic unit 142 includes: a main processing circuit and Multiple slave processing circuits;
控制器单元141,用于获取输入数据以及计算指令;控制器单元141获取的计算指令可以是由第一处理器对算子进行融合后的第一融合集合中的一个或多个算子。The controller unit 141 is configured to obtain input data and calculation instructions; the calculation instructions obtained by the controller unit 141 may be one or more operators in the first fusion set after the operators are fused by the first processor.
在一种可选方案中,一个主处理电路和多个从处理电路可以为树型结构、H型结构或者脉冲阵列机结构,本公开对主处理电路和从处理电路之前的连接方式不作限定。In an optional solution, one main processing circuit and multiple slave processing circuits may have a tree structure, an H-type structure or a pulse array machine structure. The present disclosure does not limit the connection between the main processing circuit and the slave processing circuit.
在一种可选方案中,具体的,输入数据以及计算指令可以通过数据输入输出单元得到,该数据输入输出单元具体可以为一个或多个数据I/O接口或I/O引脚。In an optional solution, specifically, the input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or more data I/O interfaces or I/O pins.
上述计算指令包括但不限于:正向运算指令或反向训练指令,或其他神经网络运算指令等等,例如卷积运算指令,或者也可以是上文所述的“WINO_CONV”指令,本申请具体实施方式并不限制上述计算指令的具体表现形式。The above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, such as convolution operation instructions, or may also be the “WINO_CONV” instruction mentioned above. This application specifically The implementation manner does not limit the specific expression form of the foregoing calculation instruction.
控制器单元141,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;The controller unit 141 is further configured to parse the calculation instruction to obtain multiple calculation instructions, and send the multiple calculation instructions and the input data to the main processing circuit;
主处理电路101,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据以及运算指令;The main processing circuit 101 is configured to perform pre-processing on the input data and transmit data and operation instructions with the plurality of slave processing circuits;
多个从处理电路102,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;A plurality of slave processing circuits 102 are configured to perform intermediate operations in parallel to obtain a plurality of intermediate results according to the data and operation instructions transmitted from the main processing circuit, and transmit the plurality of intermediate results to the main processing circuit;
主处理电路101,用于对所述多个中间结果执行后续处理得到所述计算指令的计算结果。The main processing circuit 101 is configured to perform subsequent processing on the multiple intermediate results to obtain the calculation result of the calculation instruction.
本申请提供的技术方案将运算单元设置成一主多从结构,对于正向运算的计算指令,其可以将依据正向运算的计算指令将数据进行拆分,这样通过多个从处理电路即能够对计算量较大的部分进行并行运算,从而提高运算速度,节省运算时间,进而降低功耗。The technical solution provided by this application sets the arithmetic unit into a master-multi-slave structure. For the calculation instructions of the forward operation, it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can be used The larger part of the calculation is performed in parallel, thereby increasing the speed of the calculation, saving the calculation time, and thereby reducing the power consumption.
可选的,上述机器学习计算具体可以包括:人工神经网络运算,上述输入数据具体可以包括:输 入神经元数据和权值数据。上述计算结果具体可以为:人工神经网络运算的结果即输出神经元数据。Optionally, the above-mentioned machine learning calculation may specifically include: artificial neural network operations, and the above-mentioned input data may specifically include: input neuron data and weight data. The above calculation result may specifically be: the result of the artificial neural network operation is the output neuron data.
神经网络中的运算可以为神经网络中的一层的运算,对于多层神经网络,其实现过程是,在正向运算中,当上一层人工神经网络运算执行完成之后,下一层的运算指令会将运算单元中计算出的输出神经元作为下一层的输入神经元进行运算(或者是对该输出神经元进行某些操作再作为下一层的输入神经元),同时,将权值也替换为下一层的权值;在反向运算中,当上一层人工神经网络的反向运算执行完成后,下一层运算指令会将运算单元中计算出的输入神经元梯度作为下一层的输出神经元梯度进行运算(或者是对该输入神经元梯度进行某些操作再作为下一层的输出神经元梯度),同时将权值替换为下一层的权值。The operation in the neural network can be the operation of one layer in the neural network. For the multi-layer neural network, the realization process is that in the forward operation, when the operation of the artificial neural network of the previous layer is completed, the operation of the next layer The instruction will operate the output neuron calculated in the computing unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, set the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer operation instruction will use the input neuron gradient calculated in the operation unit as the next The output neuron gradient of one layer is calculated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced with the weights of the next layer.
上述机器学习计算还可以包括支持向量机运算、k-近邻(k-nn)运算、k-均值(k-means)运算、主成分分析运算等等。为了描述的方便,下面以人工神经网络运算为例来说明机器学习计算的具体方案。The above-mentioned machine learning calculations may also include support vector machine calculations, k-nearest neighbor (k-nn) calculations, k-means calculations, principal component analysis calculations, and so on. For the convenience of description, the following takes artificial neural network operation as an example to illustrate the specific scheme of machine learning calculation.
对于人工神经网络运算,如果该人工神经网络运算具有多层运算,多层运算的输入神经元和输出神经元并非是指整个神经网络的输入层中的神经元和输出层中的神经元,而是对于网络中任意相邻的两层,处于网络正向运算下层中的神经元即为输入神经元,处于网络正向运算上层中的神经元即为输出神经元。以卷积神经网络为例,设一个卷积神经网络有L层,K=1,2,...,L-1,对于第K层和第K+1层来说,我们将第K层称为输入层,其中的神经元为所述输入神经元,第K+1层称为输出层,其中的神经元为所述输出神经元。即除最顶层外,每一层都可以作为输入层,其下一层为对应的输出层。For artificial neural network operations, if the artificial neural network operation has multi-layer operations, the input neurons and output neurons of the multi-layer operation do not refer to the neurons in the input layer and the neurons in the output layer of the entire neural network. For any two adjacent layers in the network, the neuron in the lower layer of the network forward operation is the input neuron, and the neuron in the upper layer of the network forward operation is the output neuron. Take the convolutional neural network as an example, suppose a convolutional neural network has L layers, K=1, 2,...,L-1, for the Kth layer and the K+1th layer, we will use the Kth layer It is called the input layer, and the neurons in it are the input neurons, and the K+1th layer is called the output layer, and the neurons in it are the output neurons. That is, except for the top layer, each layer can be used as the input layer, and the next layer is the corresponding output layer.
可选的,上述处理器还可以包括:该存储单元140和直接内存访问单元50,存储单元140可以包括:寄存器、缓存中的一个或任意组合,具体的,所述缓存,用于存储所述计算指令;所述寄存器,用于存储所述输入数据和标量;所述缓存为高速暂存缓存。直接内存访问单元50用于从存储单元10读取或存储数据。Optionally, the foregoing processor may further include: the storage unit 140 and the direct memory access unit 50. The storage unit 140 may include one or any combination of a register and a cache. Specifically, the cache is used to store the Calculation instruction; the register is used to store the input data and scalar; the cache is a high-speed temporary storage cache. The direct memory access unit 50 is used to read or store data from the storage unit 10.
可选的,该控制器单元包括:指令存储单元410、指令处理单元411和存储队列单元413;Optionally, the controller unit includes: an instruction storage unit 410, an instruction processing unit 411, and a storage queue unit 413;
指令存储单元410,用于存储所述人工神经网络运算关联的计算指令;The instruction storage unit 410 is configured to store calculation instructions associated with the artificial neural network operation;
所述指令处理单元411,用于对所述计算指令解析得到多个运算指令;The instruction processing unit 411 is configured to parse the calculation instructions to obtain multiple operation instructions;
存储队列单元413,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。The storage queue unit 413 is configured to store an instruction queue, and the instruction queue includes: a plurality of operation instructions or calculation instructions to be executed in the sequence of the queue.
举例说明,在一个可选的技术方案中,主运算处理电路也可以包括一个控制器单元,该控制器单元可以包括主指令处理单元,具体用于将指令译码成微指令。当然在另一种可选方案中,从运算处理电路也可以包括另一个控制器单元,该另一个控制器单元包括从指令处理单元,具体用于接收并处理微指令。上述微指令可以为指令的下一级指令,该微指令可以通过对指令的拆分或解码后获得,能被进一步解码为各部件、各单元或各处理电路的控制信号。For example, in an optional technical solution, the main arithmetic processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically configured to decode instructions into micro instructions. Of course, in another optional solution, the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically configured to receive and process micro instructions. The above-mentioned micro-instructions may be the next-level instructions of the instructions. The micro-instructions can be obtained by splitting or decoding the instructions, and can be further decoded into control signals of various components, units or processing circuits.
在一种可选方案中,该计算指令的结构可以如下表所示。In an optional solution, the structure of the calculation instruction may be as shown in the following table.
操作码Opcode 寄存器或立即数Register or immediate 寄存器/立即数Register/immediate
上表中的省略号表示可以包括多个寄存器或立即数。The ellipsis in the above table indicates that multiple registers or immediate data can be included.
在另一种可选方案中,该计算指令可以包括:一个或多个操作域以及一个操作码。该计算指令可以包括神经网络运算指令。以神经网络运算指令为例,如表1所示,其中,寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以为操作域。其中,每个寄存器号0、寄存器号1、寄存器号2、寄存器号3、寄存器号4可以是一个或者多个寄存器的号码。In another optional solution, the calculation instruction may include: one or more operation domains and an operation code. The calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Among them, each register number 0, register number 1, register number 2, register number 3, and register number 4 may be the numbers of one or more registers.
Figure PCTCN2020114057-appb-000027
Figure PCTCN2020114057-appb-000027
上述寄存器可以为片外存储器,当然在实际应用中,也可以为片内存储器,用于存储数据,该数据具体可以为n维数据,n为大于等于1的整数,例如,n=1时,为1维数据,即向量,如n=2时,为2维数据,即矩阵,如n=3或3以上时,为多维张量。The above-mentioned register can be an off-chip memory. Of course, in practical applications, it can also be an on-chip memory for storing data. The data can be n-dimensional data, and n is an integer greater than or equal to 1. For example, when n=1, It is one-dimensional data, that is, a vector. For example, when n=2, it is a two-dimensional data, that is, a matrix. When n=3 or more, it is a multi-dimensional tensor.
可选的,该控制器单元还可以包括:Optionally, the controller unit may further include:
依赖关系处理单元412,用于在具有多个运算指令时,确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,则将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;The dependency processing unit 412 is configured to determine whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and the first operation instruction. If there is an association relationship between the zero operation instruction, the first operation instruction is cached in the instruction storage unit, and after the zeroth operation instruction is executed, the first operation instruction is fetched from the instruction storage unit and transmitted to The arithmetic unit;
所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:The determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
依据所述第一运算指令提取所述第一运算指令中所需数据(例如矩阵)的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需矩阵的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,则确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,则确定所述第一运算指令与所述第零运算指令不具有关联关系。According to the first operation instruction, extract the first storage address interval of the required data (such as a matrix) in the first operation instruction, and extract the zeroth of the required matrix in the zeroth operation instruction according to the zeroth operation instruction The storage address interval, if the first storage address interval and the zeroth storage address interval have an overlapping area, it is determined that the first operation instruction and the zeroth operation instruction have an association relationship, as in the first storage If the address interval and the zeroth storage address interval do not have an overlapping area, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。It should be noted that for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should know that the present disclosure is not limited by the described sequence of actions. Because according to the present disclosure, certain steps can be performed in other order or at the same time. Secondly, those skilled in the art should also know that the embodiments described in the specification are all optional embodiments, and the involved actions and modules are not necessarily required by the present disclosure.
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限 制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be further noted that although the various steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless specifically stated in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。It should be understood that the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways. For example, the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation. For example, multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。In addition, unless otherwise specified, the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist. The modules are integrated together. The above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。If the integrated unit/module is implemented in the form of hardware, the hardware may be a digital circuit, an analog circuit, and so on. The physical realization of the hardware structure includes but is not limited to transistors, memristors and so on. Unless otherwise specified, the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on. Unless otherwise specified, the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory. Based on this understanding, the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure. The aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
在一种可能的实现方式中,还公开了一种人工智能芯片,其包括了上述运算装置。In a possible implementation manner, an artificial intelligence chip is also disclosed, which includes the aforementioned computing device.
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。In a possible implementation manner, a board card is also disclosed, which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
图6示出根据本公开实施例的板卡的结构框图,参阅图6,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;Fig. 6 shows a structural block diagram of a board card according to an embodiment of the present disclosure. Referring to Fig. 6, the board card may include other supporting components in addition to the chip 389 described above. The supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
所述存储器件390与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。The storage device 390 is connected to the artificial intelligence chip through a bus for storing data. The storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿 读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。DDR does not need to increase the clock frequency to double the speed of SDRAM. DDR allows data to be read on the rising and falling edges of the clock pulse. The speed of DDR is twice that of standard SDRAM. In an embodiment, the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips). In one embodiment, the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。In one embodiment, each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel. DDR can transmit data twice in one clock cycle. A controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE3.0X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。The interface device is electrically connected with the artificial intelligence chip. The interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer. Preferably, when the PCIE3.0X 16 interface is used for transmission, the theoretical bandwidth can reach 16000MB/s. In another embodiment, the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function. In addition, the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。The control device is electrically connected with the artificial intelligence chip. The control device is used to monitor the state of the artificial intelligence chip. Specifically, the artificial intelligence chip and the control device may be electrically connected through an SPI interface. The control device may include a single-chip microcomputer (Micro Controller Unit, MCU). For example, the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load. The control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
在一种可能的实现方式中,公开了一种电子设备,其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。In a possible implementation manner, an electronic device is disclosed, which includes the aforementioned artificial intelligence chip. Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment. The transportation means include airplanes, ships, and/or vehicles; the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。The embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor. The computer-readable storage medium may be a non-volatile computer-readable storage medium.
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute the above method.
依据以下条款可更好地理解前述内容:The foregoing can be better understood according to the following clauses:
条款A1.一种运算装置,所述运算装置包括:主指令处理单元、主功能单元、从指令处理单元、以及从功能单元,Clause A1. An arithmetic device comprising: a master instruction processing unit, a master functional unit, a slave instruction processing unit, and a slave functional unit,
其中,所述主指令处理单元用于在接收到输入指令后,根据所述输入指令向所述主功能单元发送第一控制信号;Wherein, the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
所述主功能单元用于根据所述第一控制信号将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果,向所述从功能单元发送所述输入数据的winograd正 变换结果,The main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit. The winograd positive transformation result of the input data,
其中,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果;Wherein, the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron;
所述主指令处理单元还用于向所述从指令处理单元发送第二控制信号,所述从指令处理单元用于将所述第二控制信号发送给从功能单元;The master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
所述从功能单元用于根据所述第二控制信号对输入神经元的winograd正变换结果和权值的winograd正变换结果进行对位乘得到对位乘结果,将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。The slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result. The inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
条款A2.根据条款A1所述的运算装置,所述主功能单元用于根据所述第一控制信号将所述输入数据拆解为多个第一子张量,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果。Clause A2. The computing device according to clause A1, wherein the main function unit is used to decompose the input data into a plurality of first sub-tensors according to the first control signal, Perform winograd positive transformation on the tensor and sum to obtain the winograd positive transformation result of the input data.
条款A3.根据条款A2所述的运算装置,所述输入数据以张量形式表示,所述多个第一子张量的个数与所述输入数据的不为0的元素的个数相同,所述多个第一子张量中的每个第一子张量中有一个元素与所述输入数据中的对应位置的元素相同、其他元素均为0。Clause A3. The computing device according to clause A2, wherein the input data is expressed in the form of a tensor, and the number of the plurality of first sub-tensors is the same as the number of non-zero elements of the input data, One element in each first sub-tensor of the plurality of first sub-tensors is the same as the element at the corresponding position in the input data, and other elements are all zero.
条款A4.根据条款A3所述的运算装置,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果,包括:Clause A4. According to the arithmetic device described in Clause A3, performing winograd forward transformation on the plurality of first subtensors and summing them to obtain the winograd forward transformation result of the input data, including:
获取第一子张量对应的第一元子张量的winograd正变换结果;其中,第一子张量对应的第一元子张量为:在第一元子张量中第一位置的元素的值为1,其中,第一位置在第一元子张量中所处的位置与第一子张量中的非0元素所处的位置相同;Obtain the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor; where the first sub-tensor corresponding to the first sub-tensor is: the value of the element at the first position in the first sub-tensor Is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;
将第一子张量中不为0的元素值作为系数乘以对应的第一元子张量的winograd正变换结果,得到第一子张量的winograd正变换结果;Multiplying the non-zero element value of the first sub-tensor by the coefficient of the winograd positive transformation result of the corresponding first-element sub-tensor to obtain the winograd positive transformation result of the first sub-tensor;
将多个第一子张量的winograd正变换结果相加得到所述输入数据的winograd正变换结果。The winograd positive transformation results of the multiple first subtensors are added to obtain the winograd positive transformation result of the input data.
条款A5.根据条款A4所述的运算装置,第一子张量对应的第一元子张量的winograd正变换结果是通过以下过程预先得到的:Clause A5. According to the arithmetic device described in Clause A4, the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor is obtained in advance through the following process:
对于每一个第一子张量,将该第一子张量对应的第一元子张量左边乘以正变换左乘矩阵、右边乘以正变换右乘矩阵得到第一元子张量的winograd正变换结果。For each first sub-tensor, multiply the left side of the first sub-tensor corresponding to the first sub-tensor by the positive transformation left-multiplying matrix, and the right multiplying the positive transformation right-multiplying matrix to obtain the winograd of the first-element sub-tensor Positive transformation result.
条款A6.根据条款A2所述的运算装置,所述主功能单元包括缓存模块,所述主功能单元将所述输入数据的winograd正变换结果存储到所述缓存模块中,所述缓存模块还用于向所述从功能单元发送所述输入数据的winograd正变换结果。Clause A6. The computing device according to clause A2, wherein the main functional unit includes a cache module, and the main functional unit stores the winograd conversion result of the input data in the cache module, and the cache module also uses To send the winograd positive transformation result of the input data to the slave functional unit.
条款A7.根据条款A1所述的运算装置,所述运算装置包括:主内存单元,Clause A7. The computing device according to clause A1, the computing device comprising: a main memory unit,
所述主指令处理单元用于在接收到输入指令后,根据输入指令向所述主内存单元发送第一控制信号;The main instruction processing unit is configured to send a first control signal to the main memory unit according to the input instruction after receiving the input instruction;
所述主内存单元用于根据所述第一控制信号向所述主功能单元发送所述输入数据。The main memory unit is configured to send the input data to the main function unit according to the first control signal.
条款A8.根据条款A1所述的运算装置,所述从功能单元用于将所述对位乘结果拆解为多个第二子张量,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果。Clause A8. The arithmetic device according to clause A1, wherein the slave function unit is used to decompose the alignment multiplication result into a plurality of second sub-tensors, and perform winograd inverse on the plurality of second sub-tensors Transform and sum to obtain the winograd convolution result of the input data.
条款A9.根据条款A8所述的运算装置,所述多个第二子张量的个数与对位乘结果的不为0的元素的个数相同,所述多个第二子张量中的每个第二子张量中有一个元素与对位乘结果中的对应位置的元 素相同、其他元素均为0。Clause A9. The arithmetic device according to clause A8, wherein the number of the plurality of second sub-tensors is the same as the number of non-zero elements of the result of the bitwise multiplication, and each of the plurality of second sub-tensors One element of the second subtensor is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
条款A10.根据条款A9所述的运算装置,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果,包括:Clause A10. The computing device according to clause A9, performing winograd inverse transformation on the plurality of second subtensors and summing them to obtain the winograd convolution result of the input data, including:
获取第二子张量对应的第二元子张量的winograd逆变换结果;其中,第二子张量对应的第二元子张量为:在第二元子张量中第二位置的元素的值为1,其中,第二位置在第二元子张量中所处的位置与第二子张量中的非0元素所处的位置相同;Obtain the winograd inverse transform result of the second sub-tensor corresponding to the second sub-tensor; where the second sub-tensor corresponding to the second sub-tensor is: the value of the element at the second position in the second sub-tensor Is 1, where the position of the second position in the second sub-tensor is the same as the position of the non-zero element in the second sub-tensor;
将第二子张量中不为0的元素值作为系数乘以对应的第二元子张量的winograd逆变换结果,得到第二子张量的winograd逆变换结果;Multiplying the non-zero element value of the second sub-tensor by the winograd inverse transform result of the corresponding second-element sub-tensor as the coefficient to obtain the winograd inverse transform result of the second sub-tensor;
将多个第二子张量的winograd逆变换结果相加得到所述输入数据的winograd卷积结果。The winograd inverse transform results of the multiple second subtensors are added to obtain the winograd convolution result of the input data.
条款A11.根据条款A10所述的运算装置,第二元子张量的winograd逆变换结果是通过以下过程预先得到的:Clause A11. According to the arithmetic device described in Clause A10, the winograd inverse transformation result of the second-element subtensor is obtained in advance through the following process:
对于每一个第二子张量,将该第二子张量对应的第二元子张量左边乘以逆变换左乘矩阵、右边乘以逆变换右乘矩阵得到第二元子张量的winograd逆变换结果。For each second sub-tensor, multiply the left side of the second-element sub-tensor corresponding to the second sub-tensor by the inverse transform left-multiply matrix, and the right-side multiply the inverse transform right-multiply matrix to obtain the winograd of the second-element sub-tensor Inverse transform result.
条款A12.根据条款A1所述的运算装置,所述运算装置还包括:从内存单元,Clause A12. The computing device according to clause A1, wherein the computing device further includes: a slave memory unit,
从指令处理单元还用于将所述第二控制信号发送给从内存单元;The slave instruction processing unit is also used to send the second control signal to the slave memory unit;
从内存单元用于根据所述第二控制信号向所述从功能单元发送权值的winograd正变换结果。The slave memory unit is configured to send the winograd positive transformation result of the weight to the slave functional unit according to the second control signal.
条款A13.根据条款A7-A12任意一项所述的运算装置,所述运算装置还包括主内存单元,所述从功能单元还用于将所述输入数据的winograd卷积结果发送给所述主内存单元。Clause A13. The computing device according to any one of clauses A7-A12, wherein the computing device further includes a main memory unit, and the slave function unit is further configured to send the winograd convolution result of the input data to the master Memory unit.
条款A14.根据条款A7-A12任意一项所述的运算装置,所述从功能单元还用于对所述输入数据的winograd卷积结果进行后处理,所述后处理包括按位取整操作和转数操作。Clause A14. The arithmetic device according to any one of clauses A7-A12, wherein the slave function unit is also used to perform post-processing on the winograd convolution result of the input data, and the post-processing includes bitwise rounding operations and Revolution operation.
条款A14、一种人工智能芯片,所述芯片包括如条款A1-A14中任意一项所述的运算装置。Clause A14. An artificial intelligence chip including the computing device as described in any one of clauses A1-A14.
条款A15、一种电子设备,所述电子设备包括如条款A14所述的人工智能芯片。Clause A15. An electronic device including the artificial intelligence chip as described in Clause A14.
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。The embodiments of the present disclosure are described in detail above, and specific examples are used in this article to illustrate the principles and implementations of the present disclosure. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present disclosure. At the same time, changes or modifications made by those skilled in the art based on the ideas of the present disclosure, the specific embodiments and the scope of application of the present disclosure, are all within the protection scope of the present disclosure. In summary, the content of this specification should not be construed as a limitation of this disclosure.

Claims (16)

  1. 一种运算装置,其特征在于,所述运算装置包括:主指令处理单元、主功能单元、从指令处理单元、以及从功能单元,An arithmetic device, characterized in that the arithmetic device comprises: a master instruction processing unit, a master functional unit, a slave instruction processing unit, and a slave functional unit,
    其中,所述主指令处理单元用于在接收到输入指令后,根据所述输入指令向所述主功能单元发送第一控制信号;Wherein, the main instruction processing unit is configured to send a first control signal to the main function unit according to the input instruction after receiving the input instruction;
    所述主功能单元用于根据所述第一控制信号将输入数据的winograd正变换拆解为求和运算,并进行计算得到所述输入数据的winograd正变换结果,向所述从功能单元发送所述输入数据的winograd正变换结果,The main functional unit is used to decompose the winograd positive transformation of the input data into a summation operation according to the first control signal, perform calculations to obtain the winograd positive transformation result of the input data, and send all the data to the slave functional unit. The winograd positive transformation result of the input data,
    其中,所述输入数据的winograd正变换结果包括输入神经元的winograd正变换结果;Wherein, the winograd positive transformation result of the input data includes the winograd positive transformation result of the input neuron;
    所述主指令处理单元还用于向所述从指令处理单元发送第二控制信号,所述从指令处理单元用于将所述第二控制信号发送给从功能单元;The master instruction processing unit is further configured to send a second control signal to the slave instruction processing unit, and the slave instruction processing unit is configured to send the second control signal to the slave functional unit;
    所述从功能单元用于根据所述第二控制信号对输入神经元的winograd正变换结果和权值的winograd正变换结果进行对位乘得到对位乘结果,将所述对位乘结果的winograd逆变换拆解为求和运算,并进行计算得到所述输入数据的winograd卷积结果。The slave functional unit is configured to perform bit-multiplication on the winograd positive transformation result of the input neuron and the winograd positive transformation result of the weight value according to the second control signal to obtain the bit-multiply result, and to obtain the winograd result of the bit-multiply result. The inverse transform is disassembled into a summation operation, and calculation is performed to obtain the winograd convolution result of the input data.
  2. 根据权利要求1所述的运算装置,其特征在于,所述主功能单元用于根据所述第一控制信号将所述输入数据拆解为多个第一子张量,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果。The computing device according to claim 1, wherein the main function unit is configured to decompose the input data into a plurality of first sub-tensors according to the first control signal, and perform a calculation on the plurality of first sub-tensors. A sub-tensor is subjected to winograd positive transformation and summed to obtain the winograd positive transformation result of the input data.
  3. 根据权利要求2所述的运算装置,其特征在于,所述输入数据以张量形式表示,所述多个第一子张量的个数与所述输入数据的不为0的元素的个数相同,所述多个第一子张量中的每个第一子张量中有一个元素与所述输入数据中的对应位置的元素相同、其他元素均为0。The computing device according to claim 2, wherein the input data is expressed in a tensor form, and the number of the plurality of first sub-tensors and the number of non-zero elements of the input data Same, one element in each first sub-tensor of the plurality of first sub-tensors is the same as the element at the corresponding position in the input data, and other elements are all zero.
  4. 根据权利要求3所述的运算装置,其特征在于,对所述多个第一子张量进行winograd正变换并求和得到所述输入数据的winograd正变换结果,包括:3. The computing device of claim 3, wherein performing winograd forward transformation on the plurality of first sub-tensors and summing them to obtain the winograd forward transformation result of the input data comprises:
    获取第一子张量对应的第一元子张量的winograd正变换结果;其中,第一子张量对应的第一元子张量为:在第一元子张量中第一位置的元素的值为1,其中,第一位置在第一元子张量中所处的位置与第一子张量中的非0元素所处的位置相同;Obtain the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor; where the first sub-tensor corresponding to the first sub-tensor is: the value of the element at the first position in the first sub-tensor Is 1, where the position of the first position in the first sub-tensor is the same as the position of the non-zero element in the first sub-tensor;
    将第一子张量中不为0的元素值作为系数乘以对应的第一元子张量的winograd正变换结果,得到第一子张量的winograd正变换结果;Multiplying the non-zero element value of the first sub-tensor by the coefficient of the winograd positive transformation result of the corresponding first-element sub-tensor to obtain the winograd positive transformation result of the first sub-tensor;
    将多个第一子张量的winograd正变换结果相加得到所述输入数据的winograd正变换结果。The winograd positive transformation results of the multiple first subtensors are added to obtain the winograd positive transformation result of the input data.
  5. 根据权利要求4所述的运算装置,其特征在于,第一子张量对应的第一元子张量的winograd正变换结果是通过以下过程预先得到的:The computing device according to claim 4, wherein the winograd positive transformation result of the first sub-tensor corresponding to the first sub-tensor is obtained in advance through the following process:
    对于每一个第一子张量,将该第一子张量对应的第一元子张量左边乘以正变换左乘矩阵、右边乘以正变换右乘矩阵得到第一元子张量的winograd正变换结果。For each first sub-tensor, multiply the left side of the first sub-tensor corresponding to the first sub-tensor by the positive transformation left-multiplying matrix, and the right multiplying the positive transformation right-multiplying matrix to obtain the winograd of the first-element sub-tensor Positive transformation result.
  6. 根据权利要求2所述的运算装置,其特征在于,所述主功能单元包括缓存模块,所述主功能单元将所述输入数据的winograd正变换结果存储到所述缓存模块中,所述缓存模块还用于向所述从功能单元发送所述输入数据的winograd正变换结果。The computing device according to claim 2, wherein the main functional unit comprises a cache module, the main functional unit stores the winograd positive transformation result of the input data in the cache module, and the cache module It is also used to send the winograd positive transformation result of the input data to the slave functional unit.
  7. 根据权利要求1所述的运算装置,其特征在于,所述运算装置包括:主内存单元,The computing device according to claim 1, wherein the computing device comprises: a main memory unit,
    所述主指令处理单元用于在接收到输入指令后,根据输入指令向所述主内存单元发送第一控制信 号;The main instruction processing unit is configured to send a first control signal to the main memory unit according to the input instruction after receiving the input instruction;
    所述主内存单元用于根据所述第一控制信号向所述主功能单元发送所述输入数据。The main memory unit is configured to send the input data to the main function unit according to the first control signal.
  8. 根据权利要求1所述的运算装置,其特征在于,所述从功能单元用于将所述对位乘结果拆解为多个第二子张量,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果。The computing device according to claim 1, wherein the slave function unit is used to decompose the alignment multiplication result into a plurality of second sub-tensors, and perform operations on the plurality of second sub-tensors. The winograd inverse transform and summation obtain the winograd convolution result of the input data.
  9. 根据权利要求8所述的运算装置,其特征在于,所述多个第二子张量的个数与对位乘结果的不为0的元素的个数相同,所述多个第二子张量中的每个第二子张量中有一个元素与对位乘结果中的对应位置的元素相同、其他元素均为0。8. The arithmetic device according to claim 8, wherein the number of the plurality of second sub-tensors is the same as the number of non-zero elements of the result of the bitwise multiplication, and the plurality of second sub-tensors Each second sub-tensor of has an element that is the same as the element at the corresponding position in the result of the bitwise multiplication, and the other elements are all 0.
  10. 根据权利要求9所述的运算装置,其特征在于,对所述多个第二子张量进行winograd逆变换并求和得到所述输入数据的winograd卷积结果,包括:9. The computing device of claim 9, wherein performing winograd inverse transformation on the plurality of second sub-tensors and summing them to obtain a winograd convolution result of the input data comprises:
    获取第二子张量对应的第二元子张量的winograd逆变换结果;其中,第二子张量对应的第二元子张量为:在第二元子张量中第二位置的元素的值为1,其中,第二位置在第二元子张量中所处的位置与第二子张量中的非0元素所处的位置相同;Obtain the winograd inverse transform result of the second sub-tensor corresponding to the second sub-tensor; where the second sub-tensor corresponding to the second sub-tensor is: the value of the element at the second position in the second sub-tensor Is 1, where the position of the second position in the second sub-tensor is the same as the position of the non-zero element in the second sub-tensor;
    将第二子张量中不为0的元素值作为系数乘以对应的第二元子张量的winograd逆变换结果,得到第二子张量的winograd逆变换结果;Multiplying the non-zero element value of the second sub-tensor by the winograd inverse transform result of the corresponding second-element sub-tensor as the coefficient to obtain the winograd inverse transform result of the second sub-tensor;
    将多个第二子张量的winograd逆变换结果相加得到所述输入数据的winograd卷积结果。The winograd inverse transform results of the multiple second subtensors are added to obtain the winograd convolution result of the input data.
  11. 根据权利要求10所述的运算装置,其特征在于,第二元子张量的winograd逆变换结果是通过以下过程预先得到的:11. The arithmetic device according to claim 10, wherein the winograd inverse transformation result of the second-element sub-tensor is obtained in advance through the following process:
    对于每一个第二子张量,将该第二子张量对应的第二元子张量左边乘以逆变换左乘矩阵、右边乘以逆变换右乘矩阵得到第二元子张量的winograd逆变换结果。For each second sub-tensor, multiply the left side of the second-element sub-tensor corresponding to the second sub-tensor by the inverse transform left-multiply matrix, and the right-side multiply the inverse transform right-multiply matrix to obtain the winograd of the second-element sub-tensor Inverse transform result.
  12. 根据权利要求1所述的运算装置,其特征在于,所述运算装置还包括:从内存单元,The computing device according to claim 1, wherein the computing device further comprises: a slave memory unit,
    从指令处理单元还用于将所述第二控制信号发送给从内存单元;The slave instruction processing unit is also used to send the second control signal to the slave memory unit;
    从内存单元用于根据所述第二控制信号向所述从功能单元发送权值的winograd正变换结果。The slave memory unit is configured to send the winograd positive transformation result of the weight to the slave functional unit according to the second control signal.
  13. 根据权利要求7-12任意一项所述的运算装置,其特征在于,所述运算装置还包括主内存单元,所述从功能单元还用于将所述输入数据的winograd卷积结果发送给所述主内存单元。The arithmetic device according to any one of claims 7-12, wherein the arithmetic device further comprises a main memory unit, and the slave functional unit is also used to send the winograd convolution result of the input data to the The main memory unit.
  14. 根据权利要求7-12任意一项所述的运算装置,其特征在于,所述从功能单元还用于对所述输入数据的winograd卷积结果进行后处理,所述后处理包括按位取整操作和转数操作。The arithmetic device according to any one of claims 7-12, wherein the slave function unit is further configured to perform post-processing on the winograd convolution result of the input data, and the post-processing includes bitwise rounding Operation and revolution operation.
  15. 一种人工智能芯片,其特征在于,所述芯片包括如权利要求1-14中任意一项所述的运算装置。An artificial intelligence chip, characterized in that the chip includes the computing device according to any one of claims 1-14.
  16. 一种电子设备,其特征在于,所述电子设备包括如权利要求15所述的人工智能芯片。An electronic device, wherein the electronic device comprises the artificial intelligence chip according to claim 15.
PCT/CN2020/114057 2019-11-01 2020-09-08 Operational apparatus and related product WO2021082747A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911060683.9A CN112766473B (en) 2019-11-01 2019-11-01 Computing device and related product
CN201911060683.9 2019-11-01

Publications (1)

Publication Number Publication Date
WO2021082747A1 true WO2021082747A1 (en) 2021-05-06

Family

ID=75692128

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/114057 WO2021082747A1 (en) 2019-11-01 2020-09-08 Operational apparatus and related product

Country Status (2)

Country Link
CN (1) CN112766473B (en)
WO (1) WO2021082747A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117852600A (en) * 2024-03-06 2024-04-09 北京壁仞科技开发有限公司 Artificial intelligence chip, method of operating the same, and machine-readable storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168928A (en) * 2017-05-03 2017-09-15 荣成市鼎通电子信息科技有限公司 The eight point Winograd Fourier transformers without rearrangement
CN108229654A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network convolution algorithm device and method
CN108229656A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network computing device and method
CN109117187A (en) * 2018-08-27 2019-01-01 郑州云海信息技术有限公司 Convolutional neural networks accelerated method and relevant device
US20190042923A1 (en) * 2017-08-07 2019-02-07 Intel Corporation System and method for an optimized winograd convolution accelerator
CN110096309A (en) * 2018-11-14 2019-08-06 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN110147249A (en) * 2018-02-12 2019-08-20 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
CN110163349A (en) * 2018-02-12 2019-08-23 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325591B (en) * 2018-09-26 2020-12-29 中国科学院计算技术研究所 Winograd convolution-oriented neural network processor

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229654A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network convolution algorithm device and method
CN108229656A (en) * 2016-12-14 2018-06-29 上海寒武纪信息科技有限公司 Neural network computing device and method
CN107168928A (en) * 2017-05-03 2017-09-15 荣成市鼎通电子信息科技有限公司 The eight point Winograd Fourier transformers without rearrangement
US20190042923A1 (en) * 2017-08-07 2019-02-07 Intel Corporation System and method for an optimized winograd convolution accelerator
CN110147249A (en) * 2018-02-12 2019-08-20 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
CN110163349A (en) * 2018-02-12 2019-08-23 上海寒武纪信息科技有限公司 A kind of calculation method and device of network model
CN109117187A (en) * 2018-08-27 2019-01-01 郑州云海信息技术有限公司 Convolutional neural networks accelerated method and relevant device
CN110096309A (en) * 2018-11-14 2019-08-06 上海寒武纪信息科技有限公司 Operation method, device, computer equipment and storage medium
CN110288086A (en) * 2019-06-13 2019-09-27 天津大学 A kind of configurable convolution array accelerator structure based on Winograd

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117852600A (en) * 2024-03-06 2024-04-09 北京壁仞科技开发有限公司 Artificial intelligence chip, method of operating the same, and machine-readable storage medium

Also Published As

Publication number Publication date
CN112766473B (en) 2023-12-05
CN112766473A (en) 2021-05-07

Similar Documents

Publication Publication Date Title
CN109543832B (en) Computing device and board card
TWI795519B (en) Computing apparatus, machine learning computing apparatus, combined processing device, neural network chip, electronic device, board, and method for performing machine learning calculation
US20200097792A1 (en) Processing apparatus and processing method
WO2019148781A1 (en) Operation module and method
CN109685201B (en) Operation method, device and related product
CN110163357B (en) Computing device and method
CN111047022A (en) Computing device and related product
CN110059797B (en) Computing device and related product
WO2021036362A1 (en) Method and apparatus for processing data, and related product
WO2021083101A1 (en) Data processing method and apparatus, and related product
WO2021082725A1 (en) Winograd convolution operation method and related product
WO2021082747A1 (en) Operational apparatus and related product
WO2021082746A1 (en) Operation apparatus and related product
WO2021114903A1 (en) Data processing method and apparatus, computer device, and storage medium
CN109740730B (en) Operation method, device and related product
CN109711538B (en) Operation method, device and related product
WO2021082723A1 (en) Operation apparatus
WO2022001500A1 (en) Computing apparatus, integrated circuit chip, board card, electronic device, and computing method
CN111382852B (en) Data processing device, method, chip and electronic equipment
WO2021082724A1 (en) Operation method and related product
CN112784206A (en) Winograd convolution operation method, device, equipment and storage medium
WO2021082722A1 (en) Computing device and method, and related product
WO2022134688A1 (en) Data processing circuit, data processing method, and related products
WO2021223644A1 (en) Data processing method and device, and related product
JP7368512B2 (en) Computing equipment, integrated circuit chips, board cards, electronic devices and computing methods

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20880916

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20880916

Country of ref document: EP

Kind code of ref document: A1