WO2021082723A1 - 运算装置 - Google Patents

运算装置 Download PDF

Info

Publication number
WO2021082723A1
WO2021082723A1 PCT/CN2020/113162 CN2020113162W WO2021082723A1 WO 2021082723 A1 WO2021082723 A1 WO 2021082723A1 CN 2020113162 W CN2020113162 W CN 2020113162W WO 2021082723 A1 WO2021082723 A1 WO 2021082723A1
Authority
WO
WIPO (PCT)
Prior art keywords
transformation
result
data
multiplication
unit
Prior art date
Application number
PCT/CN2020/113162
Other languages
English (en)
French (fr)
Inventor
张英男
曾洪博
张尧
刘少礼
黄迪
周诗怡
张曦珊
刘畅
郭家明
高钰峰
Original Assignee
中科寒武纪科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科寒武纪科技股份有限公司 filed Critical 中科寒武纪科技股份有限公司
Priority to US17/773,446 priority Critical patent/US20230039892A1/en
Publication of WO2021082723A1 publication Critical patent/WO2021082723A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/144Prime factor Fourier transforms, e.g. Winograd transforms, number theoretic transforms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/50Adding; Subtracting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a computing device.
  • the scale of networks including computing devices is getting larger and larger, and a larger amount of operations, especially convolution operations, need to be performed.
  • the existing computing devices have high power consumption and long calculation time, which affects their application in the field of artificial intelligence technology.
  • the first aspect of the embodiments of the present application provides an arithmetic device, the device is used to perform a winograd convolution operation, the device includes: a control unit, a storage unit, and an arithmetic unit;
  • the control unit is configured to send a control instruction, and the control instruction is used to instruct the arithmetic unit to perform a winograd convolution operation,
  • the storage unit is used to store data used for winograd convolution operation
  • the arithmetic unit is configured to extract data from the storage unit to perform a winograd convolution operation in response to the control instruction, wherein the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation It is a summation operation, and the winograd transformation of the data is completed according to the summation operation.
  • the second aspect of the embodiments of the present application provides an artificial intelligence chip, and the chip includes the computing device according to any one of the first aspect of the present application.
  • the third aspect of the embodiments of the present application provides an electronic device, and the electronic device includes the artificial intelligence chip as described in the second aspect of the present application.
  • the fourth aspect of the embodiments of the present application provides a board card, the board card includes: a storage device, an interface device, a control device, and the artificial intelligence chip as described in the second aspect of the present application;
  • the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
  • the storage device is used to store data
  • the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the fifth aspect of the embodiments of the present application provides an operation method, which is applied to an operation device, and the operation device includes: a control unit, a storage unit, and an operation unit; wherein,
  • the control unit sends a control instruction, and the control instruction is used to instruct the arithmetic unit to perform a winograd convolution operation,
  • the storage unit stores data used for winograd convolution operation
  • the arithmetic unit extracts data from the storage unit to perform a winograd convolution operation, wherein the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation into a summation Calculate, and complete the winograd transformation of the data according to the summation operation.
  • the control unit sends a control instruction.
  • the control instruction is used to instruct the arithmetic unit to perform the winograd convolution operation, and the storage unit stores data used for the winograd convolution operation; the arithmetic unit responds to the control instruction from the storage unit
  • the data is extracted in the winograd convolution operation, where the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation into a summation operation, and completes the winograd transformation of the data according to the summation operation, and replaces the transformation operation by the addition operation
  • the large number of multiplication operations accelerates the speed of winograd convolution operations, and also saves computing resources.
  • the solution provided in this application can reduce the resource consumption of convolution operations, increase the speed of convolution operations, and reduce computing time.
  • Figure 1 is a structural block diagram of an arithmetic device in an embodiment
  • Figure 2 is a structural block diagram of an arithmetic device in another embodiment
  • Fig. 3 is a structural block diagram of an arithmetic device in still another embodiment
  • Fig. 4 is a structural block diagram of an arithmetic device in another embodiment
  • Fig. 5 is a schematic flow chart of an operation method in an embodiment
  • Fig. 6 is a structural block diagram of a board in an embodiment.
  • Convolution operation refers to opening an active window with the same size as the template from the upper left corner of the image.
  • the active window corresponds to a window image, which is the convolution kernel
  • the window image corresponds to the pixels in the image Multiply and add, and use the calculation result as the first pixel value of the new image after the convolution operation.
  • the active window moves one column to the right, the window image corresponding to the active window and the pixels in the image are multiplied and then added, and the calculation result is used as the second pixel value of the new image after the convolution operation.
  • Winograd convolution is a convolution acceleration implementation method based on polynomial interpolation algorithm. It passes the two inputs of the convolution operation: the first target matrix and the second target matrix are respectively subjected to winograd convolution positive transformation, and then the first target matrix and the second target matrix after the positive transformation are subjected to bitwise multiplication, and finally The winograd convolution inverse transformation is performed again on the result of the bit multiplication, and the convolution result equivalent to the original convolution operation is obtained.
  • the existing artificial intelligence technology is usually based on the convolution operation of the processor to realize the feature extraction, for example, the operation processing of the feature data in the neural network.
  • the convolutional layer in the neural network performs convolution processing on the input feature data with a preset convolution kernel, and outputs the operation result.
  • the convolutional layer may specifically include adjacently arranged multiple convolutional layers, and the operation result obtained by each convolutional layer is the feature data input to the convolutional layer of the previous layer.
  • the convolution kernel containing the weight data is used as the window to "slide" on the matrix of the input feature data, and the local The matrix performs matrix multiplication, and the final operation result is obtained according to the result of the matrix multiplication. It can be seen that in the existing convolution operation, a large number of matrix multiplications need to be used, and the matrix multiplication needs to multiply each row of matrix elements in the characteristic data with each column of matrix elements of the convolution kernel and then accumulate the processing. The amount is huge, the processing efficiency is low, and the energy consumption of the computing device is also high.
  • an embodiment of the present application provides an arithmetic device for performing a winograd convolution operation.
  • the transformation operation of the data in the winograd convolution operation is further disassembled into a summation operation, and the winograd transformation of the data is completed according to the summation operation to accelerate the operation process.
  • Fig. 1 is a structural block diagram of an arithmetic device in an embodiment.
  • the computing device 10 shown in FIG. 1 includes: a control unit 11, a storage unit 12, and an computing unit 13. Among them, the control unit 11 controls the storage unit 12 and the operation unit 13 by issuing instructions, and finally obtains the operation result.
  • the feature data in the process of positive transformation of feature data, is the target matrix in the process; in the process of positive transformation of weight data, the weight data is the target matrix in the process; In the process of inversely transforming the result of the multiplication operation, the result of the multiplication operation is the target matrix in the process.
  • control unit 11 is configured to send a control instruction, and the control instruction is used to instruct the arithmetic unit to perform a winograd convolution operation.
  • control instruction may include a first instruction and a second instruction, wherein the first instruction includes a forward transformation instruction, and the second instruction includes a bitwise multiplication instruction and an inverse transformation instruction.
  • the control unit 11 is configured to send a first instruction and a second instruction to control the arithmetic unit to extract data from the storage unit and perform corresponding winograd transformation.
  • the storage unit 12 is used to store data used for winograd convolution operations.
  • the data includes, for example, at least one of feature data and weight data.
  • the winograd transformation includes a forward transformation and/or an inverse transformation.
  • the arithmetic unit 13 is configured to extract data from the storage unit to perform a winograd convolution operation in response to the control instruction, wherein the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation into A summation operation is performed, and the winograd transformation of the data is completed according to the summation operation.
  • the winograd convolution operation can be understood as the calculation using the following formula:
  • S represents the convolution matrix, that is, the result matrix obtained by using the feature data and the weight data to perform the convolution operation;
  • d represents the feature data;
  • g represents the weight data;
  • B represents the feature transformation matrix that implements the positive transformation of the feature data;
  • B T denotes the transpose of B;
  • G represents the weight to achieve weight data the transformation matrix of the forward transform;
  • G T denotes the transpose of G;
  • a represents a para multiplication result by the inverse transform to achieve transformation matrix;
  • a T Represents the transposition of A.
  • winograd transformations forward transformations or inverse transformations
  • This application disassembles the transformation operation of data (such as feature data) in the winograd convolution operation into a summation operation, and completes the winograd transformation of the data according to the summation operation, and replaces a large number of multiplication operations in the transformation operation with an addition operation , Which accelerates the speed of winograd convolution operation and also saves operation resources.
  • the solution provided in this application can reduce the resource consumption of convolution operation, increase the speed of convolution operation, and reduce operation time.
  • the arithmetic unit 13 is specifically configured to disassemble the data into multiple sub-tensors; perform transformation operations on the multiple sub-tensors and sum them, and obtain the result of the summation operation. The winograd transformation result of the data.
  • the process of disassembling data into multiple sub-tensors can be understood as: an arithmetic unit, specifically used to parse the data to obtain multiple sub-tensors, where the data is the sum of the multiple sub-tensors, and the The number of multiple sub-tensors is the same as the number of non-zero elements in the data, each of the sub-tensors has a single non-zero element, and the non-zero elements in the sub-tensor correspond to the corresponding positions in the data The non-zero elements are the same.
  • X ⁇ Y-scale feature data d is taken as an example to describe the process of obtaining multiple sub-tensors by analyzing the data by the arithmetic unit.
  • X ⁇ Y sub-tensors take the first sub-tensor as an example. It has a single non-zero element d 00 , and the other elements are all zero, and this non-zero element of the sub-tensor corresponds to the position in the feature data The elements of are the same, all are d 00 .
  • the arithmetic unit After disassembling and obtaining multiple sub-tensors, the arithmetic unit performs transformation operations on the multiple sub-tensors and sums them, and the process of obtaining the winograd transformation result of the data according to the result of the summation operation may specifically be: obtaining by the arithmetic unit The winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1, and the non-zero elements in the sub-tensor The value is used as a coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; and the winograd transformation results of multiple sub-tensors are added to obtain the winograd transformation result of the data.
  • the arithmetic unit is specifically configured to multiply the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix (for example, ), the right side is multiplied by the right multiplication matrix (for example, B Y ⁇ Y ) to obtain the winograd transformation result of the element sub-tensor.
  • the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, wherein the winograd transformation type includes the winograd transformation type of the forward transformation and the winograd transformation type of the inverse transformation .
  • the transformation matrices (such as B, G, A) used to implement forward or inverse transformation of data are determined by the scale of the data, and the transformation matrices corresponding to data of different scales are preset The known matrix. Therefore, the winograd transformation result of the element sub-tensor of the first sub-tensor can be understood as a constant matrix. Then, the X ⁇ Y winograd transformation results corresponding to the X ⁇ Y subtensors are added to obtain the winograd transformation results of the following feature data.
  • the winograd transformation includes forward transformation and/or inverse transformation.
  • the winograd transformation of feature data is used to exemplify the disassembly of the transformation operation into the summation operation, but the above disassembly method can also be used for the positive value data.
  • the transformation operation ((GgG T )) and the inverse transformation operation of the multiplication result of the bit multiplication of A (GgG T ) and (B T dB) will not be repeated here.
  • the control instruction includes a first instruction and a second instruction, wherein the first instruction includes a forward transformation instruction, and the second instruction includes a bitwise multiplication instruction and an inverse transformation instruction.
  • the control unit 11 is used to issue the first instruction and the second instruction.
  • the first instruction and the second instruction issued by the control unit 11 may be extracted from the storage unit 12 in advance, or may be written in advance from the outside and stored in the control unit 11.
  • both the first instruction and the second instruction include an operation code and an operation address.
  • the first instruction includes a positive conversion operation code and an operation address corresponding to the positive conversion instruction.
  • the second instruction includes a bit-to-bit multiply operation code and an operation address corresponding to the bit-to-bit multiply instruction, and an inverse conversion operation code and an operation address corresponding to the inverse conversion instruction.
  • Each instruction can include an operation code and one or more operation addresses.
  • the operation address may specifically be a register address.
  • the storage unit 12 shown in FIG. 1 stores data.
  • the data stored in the storage unit 12 may be, for example, data needed by the computing device 10 in the winograd convolution operation.
  • the data stored in the storage unit 12 includes the first instruction and the second instruction.
  • the structure of the storage unit 12 may include, for example, a register, a buffer, and a data input/output unit.
  • the arithmetic unit 13 is configured to respond to the first instruction, extract the characteristic data from the storage unit 12, and perform a positive transformation on the characteristic data, wherein the arithmetic unit 13 will perform a positive transformation on the characteristic data.
  • the positive transformation of the feature data is disassembled into a summation operation, and the positive transformation of the feature data is completed according to the summation operation to obtain a feature transformation result.
  • the arithmetic unit 13 parses the first instruction to obtain an instruction for performing a forward transformation on the characteristic data.
  • the arithmetic unit 13 reads the characteristic data from the storage unit 12, performs a forward transformation on the characteristic data, and obtains the result of the characteristic transformation.
  • the arithmetic unit 13 may also obtain a feature transformation matrix corresponding to the feature data from the storage unit 12 according to the scale of the feature data.
  • the arithmetic unit 13 is further configured to obtain the weight transformation result after the forward transformation in response to the second instruction, and perform a bitwise multiplication on the weight transformation result and the feature transformation result to obtain the multiplication operation result;
  • the multiplication operation result is inversely transformed, wherein the operation unit disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain The result of the calculation.
  • the arithmetic unit 13 when it obtains the second instruction sent by the control unit 11, it may obtain the bitwise multiplication instruction and the inverse transform instruction. Among them, the arithmetic unit 13 obtains the weight conversion result and the feature conversion result according to the bit multiplication instruction, and performs bit multiplication on the two. Among them, when the result of the feature transformation is obtained, the pre-stored weight transformation result is obtained from the storage unit 12, and the bit-wise multiplication is performed; or the weight transformation result and the feature transformation result are calculated at the same time, and then the two are compared. Bit multiplication. Alignment multiplication is the multiplication result obtained by multiplying the elements in the same position of two matrix rows and columns one-to-one, and does not change the matrix size.
  • the arithmetic unit 13 After the bit multiplication, the arithmetic unit 13 obtains the inverse transformation matrix (for example, A) corresponding to the multiplication operation result according to the inverse transformation instruction, and uses the inverse transformation matrix to inversely transform the multiplication operation result to obtain the operation result.
  • the inverse transformation matrix for example, A
  • the calculation result obtained by the computing unit 13 can be understood as the feature extraction result of the image to be processed.
  • the inverse transformation of the multiplication operation result can be disassembled into a summation operation, and the inverse transformation of the multiplication operation result is completed according to the summation operation to obtain the operation result,
  • the disassembly method is the same as the disassembly method for the positive transformation of the feature data in the foregoing embodiment, and will not be repeated here.
  • the above arithmetic device obtains the feature transformation result through the forward transformation of the feature data, and performs bit multiplication and inverse transformation on the weight transformation result and the feature transformation result, and disassembles the inverse transformation of the multiplication operation result into a summation operation , Replacing a large number of multiplication operations in the existing convolution operation process with addition operations, accelerating the operation speed and reducing resource consumption by reducing multiplication operations.
  • the weight transformation result may be calculated at the same time as the feature transformation result, or it may be obtained before the feature transformation result and stored in advance.
  • the storage unit 12 is specifically configured to store weight data.
  • the arithmetic unit 13 shown in FIG. 1 is specifically configured to extract the weight data from the storage unit 12 and perform a positive transformation on the weight data, wherein the arithmetic unit divides the positive transformation of the weight data.
  • the solution is a summation operation, and the positive transformation of the weight data is completed according to the summation operation to obtain a weight transformation result.
  • the storage unit 12 is specifically configured to receive and store the weight conversion result.
  • the arithmetic unit 13 is specifically configured to obtain the weight conversion result from the storage unit 12 in response to the second instruction.
  • the computing unit 13 may perform a forward transformation on the weight data in advance to obtain the weight transformation result, and combine the weight data.
  • the conversion result is stored in the storage unit 12.
  • the arithmetic unit 13 performs a forward transformation on the feature data in response to the first instruction, and obtains the result of the feature transformation.
  • the result of the weight transformation can be directly extracted, reducing the operation time of the winograd convolution operation.
  • the arithmetic unit 13 extracts the weight conversion result and the feature conversion result from the storage unit 12 and performs bitwise multiplication to obtain the multiplication result; inversely transforms the multiplication result, wherein the operation
  • the unit 13 disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • the process in which the arithmetic unit 13 performs positive transformation on the feature data to obtain the feature transformation result, and the process in which the arithmetic unit 13 performs bitwise multiplication on the weight transformation result and the feature transformation result can be executed simultaneously to improve the calculation speed and efficiency. .
  • the arithmetic unit 13 obtains the weight data and the feature data from the storage unit 12, and first performs a positive transformation on the weight data and the feature data respectively to obtain the weight transformation result and the feature transformation result. Then, the arithmetic unit 13 performs bit multiplication and inverse transformation on the weight conversion result and the feature conversion result in response to the second instruction.
  • the specific structure may be multiple, for example, one type of arithmetic unit 13 may include a first arithmetic unit and a second arithmetic unit; another type of arithmetic unit 13 may include an addition operation unit and a multiplication operation Unit, the following two possible structures are illustrated with examples in conjunction with the drawings.
  • Fig. 2 is a structural block diagram of an arithmetic device in another embodiment.
  • the arithmetic unit 13 may include a first arithmetic unit 131 and a second arithmetic unit 132.
  • the first arithmetic unit 131 is configured to respond to the first instruction, extract characteristic data from the storage unit, and perform a forward transformation on the characteristic data.
  • the first arithmetic unit 131 will perform a positive transformation on the characteristic data.
  • the transformation and disassembly are a summation operation, and the positive transformation of the characteristic data is completed according to the summation operation to obtain a characteristic transformation result.
  • the first arithmetic unit 131 when the first arithmetic unit 131 performs a positive transformation on the characteristic data, it may also simultaneously obtain weight data from the storage unit 12, and perform a forward transformation on the weight data to obtain a weight transformation result. Then, both the obtained feature transformation result and weight transformation result are sent to the second computing unit 132.
  • the first arithmetic unit 131 since the transmission bandwidth between the first arithmetic unit 131 and the second arithmetic unit 132 is limited, in order to reduce the bandwidth occupation, the first arithmetic unit 131 performs the forward transformation on the characteristic data in response to the first instruction. For example, before receiving the first instruction, the weight data can be forward transformed to obtain the weight transformation result, and the weight transformation result is stored in the storage unit 12. When the first arithmetic unit 131 sends the feature transformation result to the second arithmetic unit 132, the second arithmetic unit 132 may directly obtain the weight transformation result from the storage unit 12.
  • the second arithmetic unit 132 is configured to obtain the weight transformation result in response to the second instruction, perform bitwise multiplication on the weight transformation result and the feature transformation result to obtain the multiplication operation result; and perform inverse transformation on the multiplication operation result Wherein, the second operation unit 132 disassembles the inverse transformation operation of the multiplication operation result in the inverse transformation into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • Fig. 3 is a structural block diagram of an arithmetic device in another embodiment.
  • the second arithmetic unit 132 may specifically include a multiplication unit 1321 and an inverse transform unit 1322.
  • the multiplication unit 1321 is configured to respond to the second instruction to obtain the weight transformation result, and perform bitwise multiplication on the weight transformation result and the feature transformation result to obtain the multiplication operation result.
  • the multiplication unit 1321 shown in FIG. 3 may obtain the pre-stored weight conversion result from the storage unit 12, or obtain the calculated weight conversion result from the first arithmetic unit 131, which is not limited herein.
  • the inverse transformation unit 1322 is configured to perform inverse transformation on the multiplication operation result, wherein the inverse transformation unit 1322 disassembles the inverse transformation of the multiplication operation result in the inverse transformation into a summation operation, and according to the calculation The sum operation completes the inverse transformation of the multiplication operation result to obtain the operation result.
  • the inverse transformation unit 1322 may specifically obtain an inverse transformation matrix (for example, A) for inverse transformation of the multiplication operation result from the storage unit 12, and inverse transformation of the multiplication operation result using the inverse transformation matrix to obtain the operation result.
  • Fig. 4 is a structural block diagram of an arithmetic device in another embodiment.
  • the arithmetic unit 13 includes an addition unit 141 and a multiplication unit 142.
  • the process that can be completed by addition during the operation is executed by the addition unit 141, and the bitwise multiplication is executed by the special multiplication unit 142.
  • the addition unit 141 is configured to obtain characteristic data from the storage unit 12 in response to the first instruction, and perform a positive transformation on the characteristic data, wherein the addition unit 141 disassembles the positive transformation of the characteristic data In order to perform a summation operation, and complete the positive transformation of the characteristic data according to the summation operation to obtain a characteristic transformation result.
  • the addition operation unit 141 may perform a forward transformation on the weight data before receiving the first instruction to obtain The result of the weight conversion is stored in the storage unit 12. Therefore, there is no need to transmit the weight conversion result between the addition unit 141 and the multiplication unit 142, which reduces the requirement on the transmission bandwidth and improves the transmission speed.
  • the addition unit 141 may respond to the first instruction to perform a forward transformation on the feature data and simultaneously perform a forward transformation on the weight data to obtain the weight transformation result and transmit it to the multiplication unit 142 together with the feature transformation result.
  • the weight data may be data stored in the storage unit 12.
  • the multiplication unit 142 is configured to obtain a weight conversion result in response to the second instruction, and perform a bitwise multiplication on the weight conversion result and the feature conversion result to obtain the multiplication result.
  • the multiplication unit 142 performs a one-to-one multiplication on the same elements in the weight transformation result and the feature transformation result to obtain the multiplication result. For example, a total of 16 multiplications need to be performed for a 4 ⁇ 4 positive transformation weight matrix and characteristic data transformation result to obtain a 4 ⁇ 4 multiplication operation result.
  • the addition operation unit 141 is further configured to perform an inverse transformation on the multiplication operation result in response to the second instruction, wherein the addition operation unit 141 disassembles the inverse transformation of the multiplication operation result into a summation operation, And according to the summation operation, the inverse transformation of the multiplication operation result is completed to obtain the operation result.
  • the addition operation unit 141 obtains the multiplication operation result from the multiplication operation unit 142, and may obtain the inverse transformation matrix from the storage unit 12, and performs inverse transformation on the multiplication operation result with the inverse transformation matrix to obtain the operation result.
  • FIG. 5 is a schematic flowchart of an operation method in an embodiment.
  • the arithmetic method shown in FIG. 5 is applied to the arithmetic device in the above-mentioned embodiment, and the arithmetic device includes: a control unit, a storage unit, and an arithmetic unit.
  • the calculation method shown in Figure 5 includes:
  • control unit sends a control instruction, where the control instruction is used to instruct the arithmetic unit to perform a winograd convolution operation.
  • the storage unit stores data used for winograd convolution operation.
  • the arithmetic unit extracts data from the storage unit to perform a winograd convolution operation, wherein the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation into a summation Calculate, and complete the winograd transformation of the data according to the summation operation.
  • a control instruction is sent through a control unit, the control instruction is used to instruct the arithmetic unit to perform winograd convolution operation, and the storage unit stores data used for the winograd convolution operation; the arithmetic unit responds to the control instruction from the storage unit Extract data for winograd convolution operation, where the arithmetic unit disassembles the data transformation operation in the winograd convolution operation into a summation operation, and completes the winograd transformation of the data according to the summation operation, and replaces the transformation operation with the addition operation
  • a large number of multiplication operations accelerate the speed of winograd convolution operations and also save computing resources.
  • the solution provided in this application can reduce the resource consumption of convolution operations, increase the speed of convolution operations, and reduce computing time.
  • the arithmetic unit disassembles the data into multiple sub-tensors; performs a transformation operation on the multiple sub-tensors and sums them, and obtains the winograd transformation result of the data according to the result of the summation operation .
  • the arithmetic unit parses the data to obtain multiple sub-tensors, where the data is the sum of the multiple sub-tensors, and the number of the multiple sub-tensors is The number of non-zero elements is the same, each sub-tensor has a single non-zero element, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the data.
  • the arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, wherein the meta-sub-tensor is a tensor with non-zero elements of the sub-tensor set to 1. Quantities; multiplying the non-zero element values in the sub-tensor as coefficients by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; adding the winograd transformation results of multiple sub-tensors to obtain The winograd transformation result of the data.
  • the arithmetic unit multiplies the left side of the sub-tensor corresponding to the sub-tensor by the left multiplying matrix and the right side by the right multiplying matrix to obtain the element
  • the winograd transformation result of the tensor wherein the left multiplication matrix and the right multiplication matrix are both determined by the scale of the sub-tensor and the winograd transformation type, wherein the winograd transformation type includes the winograd transformation type of the positive transformation
  • the data includes at least one of feature data and weight data
  • the winograd transformation includes a forward transformation and/or an inverse transformation.
  • control instruction includes a first instruction and a second instruction, wherein the first instruction includes a forward transformation instruction, and the second instruction includes a bit multiplication instruction and an inverse transformation instruction;
  • the arithmetic unit extracts the characteristic data from the storage unit, and performs a winograd convolution operation on the characteristic data, wherein the arithmetic unit performs a winograd convolution operation on the winograd convolution operation.
  • the transformation operation of the characteristic data is disassembled into a summation operation, and the positive transformation of the characteristic data is completed according to the summation operation to obtain a characteristic transformation result;
  • the arithmetic unit also responds to the second instruction to obtain the weight conversion result that has undergone the forward transformation, performs bitwise multiplication on the weight conversion result and the feature conversion result to obtain the multiplication operation result; and performs the inverse of the multiplication operation result. Transformation, wherein the arithmetic unit disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • the storage unit receives and stores the weight conversion result
  • the arithmetic unit In response to the second instruction, extracts the weight conversion result from the storage unit.
  • the storage unit stores weight data
  • the arithmetic unit extracts the weight data from the storage unit and performs a positive transformation on the weight data, wherein the arithmetic unit disassembles the positive transformation of the weight data into a summation operation, And according to the summation operation, the positive transformation of the weight data is completed, and the weight transformation result is obtained.
  • the arithmetic unit includes: a first arithmetic unit and a second arithmetic unit;
  • the first arithmetic unit In response to the first instruction, extracts characteristic data from the storage unit in response to the first instruction, and performs a forward transformation on the characteristic data, wherein the first arithmetic unit will The positive transformation of the characteristic data is disassembled into a summation operation, and the positive transformation of the characteristic data is completed according to the summation operation to obtain a characteristic transformation result;
  • the second arithmetic unit obtains the weight conversion result in response to the second instruction, performs bitwise multiplication on the weight conversion result and the feature conversion result to obtain a multiplication operation result; and performs an inverse transformation on the multiplication operation result, wherein The second operation unit disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • the second arithmetic unit includes: a multiplication unit and an inverse transform unit;
  • the multiplication unit obtains a weight conversion result that has undergone a positive transformation, and performs a bitwise multiplication on the weight conversion result and the feature conversion result to obtain a multiplication operation result;
  • the inverse transform unit performs inverse transform on the multiplication operation result, wherein the inverse transform unit disassembles the inverse transform of the multiplication operation result into a summation operation, and completes the multiplication according to the summation operation
  • the inverse transformation of the operation result obtains the operation result.
  • the operation unit includes: an addition operation unit and a multiplication operation unit;
  • the addition operation unit obtains characteristic data from the storage unit, and performs a positive transformation on the characteristic data, wherein the addition operation unit disassembles the positive transformation of the characteristic data into Sum operation, and complete the positive transformation of the feature data according to the sum operation to obtain a feature transformation result;
  • the multiplication unit responds to the second instruction to obtain a weight conversion result, and performs a bitwise multiplication on the weight conversion result and the feature conversion result to obtain the multiplication result;
  • the addition operation unit is further configured to perform an inverse transformation on the multiplication operation result in response to the second instruction, wherein the addition operation unit disassembles the inverse transformation of the multiplication operation result into a summation operation, And according to the summation operation, the inverse transformation of the multiplication operation result is completed to obtain the operation result.
  • steps in the flowchart are displayed in sequence according to the directions of the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order for the execution of these steps, and these steps can be executed in other orders. Moreover, at least part of the steps in the flowchart may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but can be executed at different times. The execution of these sub-steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.
  • the foregoing device embodiments are only illustrative, and the device of the present disclosure may also be implemented in other ways.
  • the division of units/modules in the above-mentioned embodiments is only a logical function division, and there may be other division methods in actual implementation.
  • multiple units, modules or components may be combined or integrated into another system, or some features may be omitted or not implemented.
  • the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
  • the modules are integrated together.
  • the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
  • the hardware may be a digital circuit, an analog circuit, and so on.
  • the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
  • the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
  • the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
  • RRAM Resistive Random Access Memory
  • DRAM Dynamic Random Access Memory
  • Static random access memory SRAM Static Random-Access Memory
  • enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
  • high-bandwidth memory HBM High-Bandwidth Memory
  • hybrid storage cube HMC Hybrid Memory Cube
  • the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
  • the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
  • the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
  • an artificial intelligence chip is also disclosed, which includes the aforementioned computing device.
  • a board card which includes a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip; wherein, the artificial intelligence chip is connected to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to realize data transmission between the artificial intelligence chip and an external device; the control device is used to The state of the artificial intelligence chip is monitored.
  • Fig. 6 shows a structural block diagram of a board card according to an embodiment of the present disclosure.
  • the board card may include other supporting components in addition to the chip 389 described above.
  • the supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
  • the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
  • the storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
  • the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
  • the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
  • each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
  • DDR can transmit data twice in one clock cycle.
  • a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
  • the interface device is electrically connected with the artificial intelligence chip.
  • the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
  • the interface device may be a standard PCIE interface.
  • the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
  • the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
  • the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
  • the control device is electrically connected with the artificial intelligence chip.
  • the control device is used to monitor the state of the artificial intelligence chip.
  • the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
  • the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
  • MCU Micro Controller Unit
  • the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
  • the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
  • an electronic device which includes the aforementioned artificial intelligence chip.
  • Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
  • the transportation means include airplanes, ships, and/or vehicles;
  • the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
  • the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
  • An arithmetic method applied to an arithmetic device comprising: a control unit, a storage unit, and an arithmetic unit; wherein,
  • the control unit sends a control instruction, and the control instruction is used to instruct the arithmetic unit to perform a winograd convolution operation,
  • the storage unit stores data used for winograd convolution operation
  • the arithmetic unit extracts data from the storage unit to perform a winograd convolution operation, wherein the arithmetic unit disassembles the transformation operation of the data in the winograd convolution operation into a summation Calculate, and complete the winograd transformation of the data according to the summation operation.
  • the arithmetic unit disassembles the data into multiple sub-tensors; performs transformation operations on the multiple sub-tensors and sums them, and obtains the data according to the result of the summation operation The result of winograd transformation.
  • the arithmetic unit parses the data to obtain multiple sub-tensors, where the data is the sum of the multiple sub-tensors, the number of the multiple sub-tensors and the number of non-zero elements in the data In the same way, there is a single non-zero element in each sub-tensor, and the non-zero element in the sub-tensor is the same as the non-zero element at the corresponding position in the data.
  • the arithmetic unit obtains the winograd transformation result of the meta-sub-tensor corresponding to each sub-tensor, where the meta-sub-tensor is a tensor in which the non-zero elements of the sub-tensor are set to 1, and the sub-tensor is set to 1;
  • the non-zero element value in the tensor is used as the coefficient multiplied by the winograd transformation result of the corresponding element sub-tensor to obtain the winograd transformation result of the sub-tensor; the winograd transformation results of multiple sub-tensors are added to obtain the winograd transformation result of the data result.
  • the operation unit multiplies the left side of the element sub-tensor corresponding to the sub-tensor by the left multiplication matrix, and the right side by the right multiplication matrix, Obtain the winograd transformation result of the element sub-tensor, where the left multiplication matrix and the right multiplication matrix are both determined by the size of the subtensor and the winograd transformation type, wherein the winograd transformation type includes positive The winograd transformation type of the transformation and the winograd transformation type of the inverse transformation.
  • the data includes at least one of characteristic data and weight data
  • the winograd transformation includes a forward transformation and/or an inverse transformation.
  • control instruction includes a first instruction and a second instruction, wherein the first instruction includes a forward conversion instruction, and the second instruction includes a bitwise multiplication instruction and an inverse transformation instruction ;
  • the arithmetic unit extracts the characteristic data from the storage unit, and performs a winograd convolution operation on the characteristic data, wherein the arithmetic unit performs a winograd convolution operation on the winograd convolution operation.
  • the transformation operation of the characteristic data is disassembled into a summation operation, and the positive transformation of the characteristic data is completed according to the summation operation to obtain a characteristic transformation result;
  • the arithmetic unit also responds to the second instruction to obtain the weight conversion result that has undergone the forward transformation, performs bitwise multiplication on the weight conversion result and the feature conversion result to obtain the multiplication operation result; and performs the inverse of the multiplication operation result. Transformation, wherein the arithmetic unit disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • the storage unit receives and stores the weight conversion result
  • the arithmetic unit In response to the second instruction, extracts the weight conversion result from the storage unit.
  • the storage unit stores weight data
  • the arithmetic unit extracts the weight data from the storage unit and performs a positive transformation on the weight data, wherein the arithmetic unit disassembles the positive transformation of the weight data into a summation operation, And according to the summation operation, the positive transformation of the weight data is completed, and the weight transformation result is obtained.
  • the first arithmetic unit In response to the first instruction, extracts characteristic data from the storage unit in response to the first instruction, and performs a forward transformation on the characteristic data, wherein the first arithmetic unit will The positive transformation of the characteristic data is disassembled into a summation operation, and the positive transformation of the characteristic data is completed according to the summation operation to obtain a characteristic transformation result;
  • the second arithmetic unit obtains the weight conversion result in response to the second instruction, performs bitwise multiplication on the weight conversion result and the feature conversion result to obtain a multiplication operation result; and performs an inverse transformation on the multiplication operation result, wherein The second operation unit disassembles the inverse transformation of the multiplication operation result into a summation operation, and completes the inverse transformation of the multiplication operation result according to the summation operation to obtain the operation result.
  • the multiplication unit obtains a weight conversion result that has undergone a positive transformation, and performs a bitwise multiplication on the weight conversion result and the feature conversion result to obtain a multiplication operation result;
  • the inverse transform unit performs inverse transform on the multiplication operation result, wherein the inverse transform unit disassembles the inverse transform of the multiplication operation result into a summation operation, and completes the multiplication according to the summation operation
  • the inverse transformation of the operation result obtains the operation result.
  • the addition operation unit obtains characteristic data from the storage unit, and performs a positive transformation on the characteristic data, wherein the addition operation unit disassembles the positive transformation of the characteristic data into Sum operation, and complete the positive transformation of the feature data according to the sum operation to obtain a feature transformation result;
  • the multiplication unit responds to the second instruction to obtain a weight conversion result, and performs a bitwise multiplication on the weight conversion result and the feature conversion result to obtain the multiplication result;
  • the addition operation unit is further configured to perform an inverse transformation on the multiplication operation result in response to the second instruction, wherein the addition operation unit disassembles the inverse transformation of the multiplication operation result into a summation operation, And according to the summation operation, the inverse transformation of the multiplication operation result is completed to obtain the operation result.

Abstract

一种运算装置(10),运算装置(10)包括存储单元(12)、控制单元(11)和运算单元(13)。上述运算装置(10)可降低卷积运算的资源消耗、提高卷积运算速度、减少运算时间。

Description

运算装置
本申请要求于2019年11月1日提交中国专利局、申请号为2019110619519、申请名称为“运算装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及一种运算装置。
背景技术
随着人工智能技术的发展,用于实现人工智能技术的各种运算装置和包含运算装置的网络,被广泛应用于计算机视觉、自然语言处理等领域。
为了适应越来越高的任务要求,包含运算装置的网络规模越来越大,需要做更大量的运算,尤其是卷积运算。而在进行卷积运算时,现有的运算装置功耗高、计算时间较长,影响其在人工智能技术领域中的应用。
发明内容
基于此,有必要针对上述技术问题,提供一种能够提高运算速度、减少运算时间、降低功耗的运算装置。
本申请实施例第一方面提供了一种运算装置,所述装置用于进行winograd卷积运算,所述装置包括:控制单元、存储单元、以及运算单元;
所述控制单元,用于发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算,
所述存储单元,用于存储用于winograd卷积运算的数据;
所述运算单元,用于响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
本申请实施例第二方面提供了一种人工智能芯片,所述芯片包括如本申请第一方面中任意一项所述的运算装置。
本申请实施例第三方面提供了一种电子设备,所述电子设备包括如本申请第二方面所述的人工智能芯片。
本申请实施例第四方面提供了一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如本申请第二方面所述的人工智能芯片;
其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
所述控制器件,用于对所述人工智能芯片的状态进行监控。
本申请实施例第五方面提供了一种运算方法,应用于运算装置,所述运算装置包括:控制单元、存储单元、以及运算单元;其中,
所述控制单元发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算,
所述存储单元存储用于winograd卷积运算的数据;
所述运算单元响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
以上,本申请实施例的方案中,控制单元发送控制指令,控制指令用于指示运算单元进行winograd卷积运算,存储单元存储用于winograd卷积运算的数据;运算单元响应控制指令,从存储单元中提取数据进行winograd卷积运算,其中,运算单元将winograd卷积运算中对数据的变换运算拆解为求和运算,并根据求和运算完成数据的winograd变换,通过以加法运算替代变换运算中的大量乘法运算,加速了winograd卷积运算的速度,也节约了运算资源,本申请提供的方案可降低卷积运算的资源消耗、提高卷积运算速度、减少运算时间。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍。
图1为一个实施例中运算装置的结构框图;
图2为另一个实施例中运算装置的结构框图;
图3为再一个实施例中运算装置的结构框图;
图4为又一个实施例中运算装置的结构框图;
图5为一个实施例中运算方法的流程示意图;
图6为一个实施例中板卡的结构框图。
其中,图1至图6中的标识如下:
10:运算装置;
11:控制单元;
12:存储单元;
13:运算单元;
131:第一运算单元;
132:第二运算单元;
1321:乘法单元;
1322:逆变换单元;
141:加法运算单元;
142:乘法运算单元;
389:芯片;
390:存储器件;
391:接口装置;
392:控制器件。
具体实施方式
下面将结合本披露实施例中的附图,对本披露实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本披露一部分实施例,而不是全部的实施例。基于本披露中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本披露保护的范围。
应当理解,本披露的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本披露的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本披露说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本披露。如在本披露说明书和权利要求书中所使用的那样,除非 上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本披露说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
为了清楚理解本申请的技术方案,下面对现有技术和本申请实施例中涉及的技术术语进行解释:
卷积运算:卷积运算是指从图像的左上角开始,开一个与模板同样大小的活动窗口,活动窗口对应一个窗口图像,该窗口图像为卷积核,窗口图像与图像中的像素对应起来相乘再相加,并用计算结果作为卷积运算后新图像的第一个像素值。然后,活动窗口向右移动一列,将活动窗口对应的窗口图像与图像中的像素对应起来相乘再相加,并用计算结果作为卷积运算后新图像的第二个像素值。以此类推,从左到右、从上到下,即可得到一幅新图像。
winograd卷积:Winograd卷积是一种基于多项式插值算法的卷积加速实现方式。它通过对卷积操作的两个输入:第一目标矩阵和第二目标矩阵分别进行winograd卷积正变换,再将正变换后的第一目标矩阵和第二目标矩阵进行对位乘法,最后对对位乘法结果再次进行winograd卷积逆变换,得到与原卷积操作等价的卷积结果。
现有的人工智能技术,通常是基于处理器的卷积运算来实现特征的提取,例如神经网络中对特征数据的运算处理。神经网络中的卷积层对输入的特征数据以预设的卷积核进行卷积处理,并输出运算结果。其中,卷积层具体可以包含有相邻设置的多层卷积层,且每一层卷积层得到的运算结果,为对其上一层卷积层输入的特征数据。
针对每一卷积层中的运算,现有的卷积运算中,以包含权值数据的卷积核为窗口,在输入的特征数据的矩阵上“滑动”,对每个滑动位置上的局部矩阵执行矩阵乘,并根据矩阵乘的结果得到最后的运算结果。可见,现有的卷积运算中,需要大量运用到矩阵乘,而矩阵乘需要对特征数据中每行矩阵元素与卷积核的每列矩阵元素进行对位相乘后累加的处理,其计算量庞大、处理效率低,运算装置能耗也高。
为了解决上述问题,本申请实施例提供了一种运算装置,用于进行winograd卷积运算。其中,还进一步将winograd卷积运算中对数据的变换运算拆解为求和运算,并根据该求和运算完成所述数据的winograd变换实现对运算过程的加速。参见图1,为一个实施例中运算装置的结构框图。如图1所示的运算装置10,包括:控制单元11、存储单元12、以及运算单元13。其中,控制单元11通过发出指令而对存储单元12和运算单元13进行控制,最终得到运算结果。在本实施例中,在对特征数据进行正变 换的过程中,特征数据就是该过程中的目标矩阵;在对权值数据进行正变换的过程中,权值数据就是该过程中的目标矩阵;在对乘法运算结果进行逆变换的过程中,乘法运算结果就是该过程中的目标矩阵。
继续参见图1,控制单元11,用于发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算。例如,控制指令可以包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令。控制单元11,用于发送第一指令和第二指令,以控制所述运算单元从存储单元中提取数据,进行相应的winograd变换。
存储单元12,用于存储用于winograd卷积运算的数据。该数据例如包括特征数据、权值数据中的至少一种。所述winograd变换包括正变换和/或逆变换。运算单元13,用于响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。在本申请的各种实施例中,winograd卷积运算可以理解为是采用下式进行计算:
S=A T((GgG T)⊙(B TdB))A
其中,S表示卷积矩阵,即使用特征数据与权值数据进行卷积运算得到的结果矩阵;d表示特征数据;g表示权值数据;B表示将特征数据实现正变换的特征变换矩阵;B T表示B的转置;G表示将权值数据实现正变换的权值变换矩阵;G T表示G的转置;A表示将对位乘后的乘法运算结果实现逆变换的变换矩阵;A T表示A的转置。
由上式可知,在winograd卷积运算中需要进行多次winograd变换(正变换或逆变换)而这些winograd变换涉及大量乘法运算。本申请通过将winograd卷积运算中对数据(例如特征数据)的变换运算拆解为求和运算,并根据求和运算完成所述数据的winograd变换,以加法运算替代变换运算中的大量乘法运算,加速了winograd卷积运算的速度,也节约了运算资源,本申请提供的方案可降低卷积运算的资源消耗、提高卷积运算速度、减少运算时间。
在一些实施例中,所述运算单元13,具体用于将所述数据拆解为多个子张量;对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果。
将数据拆解为多个子张量的过程,可以理解为:运算单元,具体用于从所述数据解析得到多个子张量,其中,所述数据为所述多个子张量之和,所述多个子张量的个数与所述数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述 子张量中的非0元素与在所述数据中对应位置的非0元素相同。
以下面X×Y规模的特征数据d为例,对运算单元解析数据得到多个子张量的过程进行说明。
Figure PCTCN2020113162-appb-000001
将上述特征数据d拆解为多个子张量:
Figure PCTCN2020113162-appb-000002
假设特征数据d中所有元素都是非0元素,由此,得到X×Y个子张量。在其他实施例中,假如特征数据d中仅有3个元素是非0元素,则只能得到3个子张量,在此不做具体举例。
Figure PCTCN2020113162-appb-000003
在上述X×Y个子张量中,以第一个子张量为例,其具有单个非0元素d 00,其余元素均为0,且该子张量的这个非0元素,与特征数据中对应位置的元素相同,均为d 00
在拆解得到多个子张量后,运算单元对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果的过程,具体可以是:运算单元获取各子张量对应的元子张量的winograd变换结果,其中,所述元子张量是将所述子张量的非0元素置为1的张量;将所述子张量中非0的元素值作为系数乘以对应的元子张量的winograd变换结果,得到所述子张量的winograd变换结果;将多个子张量的winograd变换结果相加得到所述数据的winograd变换结果。
继续以上述特征数据d为例进行举例,以第一个子张量为例,将其中的非0元素d 00置为1后可以得到如下的元子张量:
Figure PCTCN2020113162-appb-000004
对第一个子张量的元子张量可获取到如下的winograd变换结果:
Figure PCTCN2020113162-appb-000005
其中,运算单元,具体用于对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵(例如
Figure PCTCN2020113162-appb-000006
)、右边乘以右乘矩阵(例如B Y×Y),得到所述元子张量的winograd变换结果。其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及winograd变换类型确定的,其中所述winograd变换类型包括正变换的winograd变换类型和逆变换的winograd变换类型。
在具体的实现方式中,用于将数据实现正变换或逆变换的变换矩阵(例如B、G、A)都是由数据的规模而确定的,不同规模的数据对应的变换矩阵都是预设的已知矩阵。因此,对上述第一个子张量的元子张量的winograd变换结果,可以理解为一常数矩阵。然后,将X×Y个子张量对应的X×Y个winograd变换结果相加,得到如下特征数据的winograd变换结果。
Figure PCTCN2020113162-appb-000007
所述winograd变换包括正变换和/或逆变换,上述实施例中是以特征数据的winograd变换,对变换运算拆解为求和运算进行举例,但上述拆解方式也可以用于权值数据正变换运算((GgG T))、以及A(GgG T)与(B TdB)对位乘的乘法运算结果逆变换运算中,在此不做赘述。
在一些实施例中,所述控制指令包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令。控制单元11,用于发出第一指令和第二指令。在一些实施例中,控制单元11发出的第一指令和第二指令可以是预先从存储单元12中提取的,也可以是预先由外部写入并存储在控制单元11内的。例如,第一指令和第二指令都包括操作码和操作地址。第一指令包括与正变换指令相对应的正变 换操作码、操作地址。第二指令包括与对位乘指令相对应的对位乘操作码、操作地址,以及与逆变换指令相对应的逆变换操作码、操作地址。每条指令可以包括一个操作码以及一个或多个操作地址。其中,操作地址,具体可以是寄存器地址。
图1所示的存储单元12存储数据。存储单元12存储的数据例如可以是运算装置10在winograd卷积运算中需要用到的数据。在上述第一指令和第二指令是预先从存储单元12中提取的实施例中,存储单元12存储的数据中包含了第一指令和第二指令。存储单元12的结构例如可以包括寄存器、缓存和数据输入/输出单元。
继续参见图1,运算单元13,用于响应所述第一指令,从所述存储单元12中提取所述特征数据,对所述特征数据进行正变换,其中,所述运算单元13将对所述特征数据的正变换拆解为求和运算,并根据求和运算完成所述特征数据的正变换,得到特征变换结果。具体地,例如,运算单元13获取到控制单元11发来的第一指令时,运算单元13从第一指令解析得到对特征数据进行正变换的指令。运算单元13从存储单元12读取特征数据,对该特征数据进行正变换,得到特征变换结果。其中,运算单元13还可以根据特征数据的规模,从存储单元12获取到与该特征数据相对应的特征变换矩阵。
继续参见图1,运算单元13,还用于响应所述第二指令获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
具体地,例如,运算单元13获取到控制单元11发来的第二指令时,可以是获取到对位乘指令和逆变换指令。其中,运算单元13根据对位乘指令获取权值变换结果和特征变换结果,并对两者进行对位乘。其中,可以是在得到特征变换结果时,从存储单元12获取预先存储的权值变换结果,进行对位乘;也可以是同时计算得到权值变换结果和特征变换结果,然后对两者进行对位乘。对位乘是两个矩阵行列相同位置元素一一对应相乘得到的乘法运算结果,并不改变矩阵规模。在对位乘之后,运算单元13根据逆变换指令,获取与乘法运算结果相对应的逆变换矩阵(例如A),并以该逆变换矩阵对乘法运算结果进行逆变换,得到运算结果。示例性的,如果特征数据是待处理图像的特征数据,那么运算单元13得到的运算结果可以理解为对待处理图像的特征提取结果。上述对乘法运算结果进行逆变换的过程中,都可以将对所述乘法运算结果的逆变换拆解为求和运算,并根据求和运算完成所述乘法运算结果的逆变换,得到运算结果,拆解方式与前述实施例中对特征数据正变换的拆解方式相同,在此不做赘述。
以上运算装置,通过对特征数据的正变换得到特征变换结果,以及对权值变换结果 和特征变换结果进行对位乘和逆变换,将对所述乘法运算结果的逆变换拆解为求和运算,以加法运算替代了现有卷积运算过程中的大量乘法运算,通过减少乘法运算加速运算速度、减少了资源消耗。
在上述实施例中,权值变换结果可以是与特征变换结果同时计算得到的,也可以是先于特征变换结果得到而预先存储的。
在一些实施例中,存储单元12,具体用于存储权值数据。图1所示的运算单元13,具体用于从存储单元12提取所述权值数据,对所述权值数据进行正变换,其中,所述运算单元将对所述权值数据的正变换拆解为求和运算,并根据求和运算完成所述权值数据的正变换,获得权值变换结果。
在另一些实施例中,存储单元12,具体用于接收权值变换结果并存储。运算单元13,具体用于响应所述第二指令,从所述存储单元12中获取所述权值变换结果。示例性的,本实施例在对权值数据进行预先存储时或者是接收到第一指令之前,运算单元13可以预先对所述权值数据进行正变换,获得权值变换结果,并将权值变换结果存入存储单元12。然后,运算单元13响应第一指令对特征数据进行正变换,得到特征变换结果。由此,可以直接提取权值变换结果,减少winograd卷积运算的运算时间。最后,运算单元13响应第二指令,从存储单元12中提取对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,运算单元13将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。可选地,运算单元13对特征数据进行正变换得到特征变换结果的过程,和运算单元13对权值变换结果和特征变换结果进行对位乘的过程,可以同步执行,以提高运算速度和效率。
又可选地,运算单元13响应第一指令,从存储单元12获取权值数据和特征数据,先分别对权值数据和特征数据进行正变换,得到权值变换结果和特征变换结果。然后,运算单元13响应第二指令对权值变换结果和特征变换结果进行对位乘和逆变换。
对于上述实施例中的运算单元13,具体结构可以是多种,例如,一种运算单元13可以包括第一运算单元和第二运算单元;另一种运算单元13可以包括加法运算单元和乘法运算单元,下面结合附图对这两种可能的结构进行举例说明。
参见图2,为另一个实施例中运算装置的结构框图。如图2所示的运算装置10,运算单元13可以包括第一运算单元131和第二运算单元132。
其中,第一运算单元131,用于响应所述第一指令,从所述存储单元提取特征数据,对所述特征数据进行正变换,其中,第一运算单元131将对所述特征数据的正变换拆解 为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果。在一些实施例中,第一运算单元131在对所述特征数据进行正变换时,还可以同时从所述存储单元12获取权值数据,对权值数据进行正变换,得到权值变换结果。然后,将得到的特征变换结果和权值变换结果都发送至第二运算单元132。在另一些实施例中,由于第一运算单元131与第二运算单元132之间传输带宽有限,为了降低带宽占用量,第一运算单元131在响应第一指令对所述特征数据进行正变换之前,例如在接收到第一指令之前,可以预先对权值数据进行正变换,得到权值变换结果,并将权值变换结果存储在存储单元12中。在第一运算单元131将特征变换结果发给第二运算单元132时,第二运算单元132可以直接从存储单元12中获取到权值变换结果。由此,第一运算单元131与第二运算单元132之间不需要传输权值变换结果,降低了对第一运算单元131与第二运算单元132之间传输带宽的要求、提高了传输速度。
第二运算单元132,用于响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,第二运算单元132将所述逆变换中对所述乘法运算结果的逆变换运算拆解为求和运算,并根据求和运算完成所述乘法运算结果的逆变换,得到运算结果。
参见图3,为再一个实施例中运算装置的结构框图。在图3所示的运算装置10中,第二运算单元132具体可以包括乘法单元1321和逆变换单元1322。
其中,乘法单元1321,用于响应所述第二指令,获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果。图3所示乘法单元1321,可以是从存储单元12获取预先存储的权值变换结果,或者是从第一运算单元131获取其计算得到的权值变换结果,在此不做限定。
逆变换单元1322,用于对所述乘法运算结果进行逆变换,其中,逆变换单元1322将所述逆变换中对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到所述运算结果。逆变换单元1322具体可以从存储单元12获取用于对乘法运算结果进行逆变换逆变换矩阵(例如A),并以逆变换矩阵对对所述乘法运算结果进行逆变换,得到运算结果。
参见图4,为又一个实施例中运算装置的结构框图。在图4所示的运算装置10中,运算单元13包括加法运算单元141和乘法运算单元142。在本实施例中,将运算过程中能够以加法完成的过程用加法运算单元141执行,而将对位乘用专门的乘法运算单元142执行。
加法运算单元141,用于响应所述第一指令,从所述存储单元12获取特征数据,对所述特征数据进行正变换,其中,加法运算单元141将对所述特征数据的正变换拆解为求 和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果其中,加法运算单元141可以在接收到第一指令之前,预先对权值数据进行正变换,得到权值变换结果,并将权值变换结果存入存储单元12。由此,加法运算单元141和乘法运算单元142之间不需要传输权值变换结果,降低了对传输带宽的要求、提高了传输速度。或者,加法运算单元141可以响应第一指令,对所述特征数据进行正变换的同时,对权值数据进行正变换,得到权值变换结果后,与特征变换结果一起传输给乘法运算单元142。权值数据可以是存储在存储单元12中的数据。
乘法运算单元142,用于响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,获得乘法运算结果。乘法运算单元142对权值变换结果和特征变换结果中行列相同的元素一一对应做乘法,获得乘法运算结果。例如对4×4的正变换后权值矩阵和特征数据变换结果,一共需要执行16次乘法,获得4×4的乘法运算结果。
加法运算单元141,还用于响应所述第二指令,对所述乘法运算结果进行逆变换,其中,所述加法运算单元141将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。加法运算单元141从乘法运算单元142获取乘法运算结果,并且可以从存储单元12获取逆变换矩阵,以逆变换矩阵对乘法运算结果进行逆变换,得到运算结果。
参见图5,为一个实施例中运算方法的流程示意图。图5所示的运算方法应用于上述实施例中的运算装置,所述运算装置包括:控制单元、存储单元、以及运算单元。其中,图5所示的运算方法包括:
S101,控制单元发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算。
S102,存储单元存储用于winograd卷积运算的数据。
S103,运算单元响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
本申请提供的运算方法中,通过控制单元发送控制指令,控制指令用于指示运算单元进行winograd卷积运算,存储单元存储用于winograd卷积运算的数据;运算单元响应控制指令,从存储单元中提取数据进行winograd卷积运算,其中,运算单元将winograd卷积运算中对数据的变换运算拆解为求和运算,并根据求和运算完成数据的winograd变换,通过以加法运算替代变换运算中的大量乘法运算,加速了winograd 卷积运算的速度,也节约了运算资源,本申请提供的方案可降低卷积运算的资源消耗、提高卷积运算速度、减少运算时间。
在一些实施例中,所述运算单元将所述数据拆解为多个子张量;对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果。
在一些实施例中,所述运算单元从所述数据解析得到多个子张量,其中,所述数据为所述多个子张量之和,所述多个子张量的个数与所述数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述数据中对应位置的非0元素相同。
在一些实施例中,所述运算单元获取各子张量对应的元子张量的winograd变换结果,其中,所述元子张量是将所述子张量的非0元素置为1的张量;将所述子张量中非0的元素值作为系数乘以对应的元子张量的winograd变换结果,得到所述子张量的winograd变换结果;将多个子张量的winograd变换结果相加得到所述数据的winograd变换结果。
在一些实施例中,所述运算单元对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的winograd变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及winograd变换类型确定的,其中所述winograd变换类型包括正变换的winograd变换类型和逆变换的winograd变换类型。
在一些实施例中,所述数据包括特征数据、权值数据中的至少一种;
所述winograd变换包括正变换和/或逆变换。
在一些实施例中,所述控制指令包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令;
所述运算单元响应所述第一指令,从所述存储单元中提取所述特征数据,对所述特征数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述特征数据的变换运算拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述运算单元还响应所述第二指令获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
在一些实施例中,所述存储单元接收权值变换结果并存储;
所述运算单元响应所述第二指令,从所述存储单元中提取所述权值变换结果。
在一些实施例中,所述存储单元存储权值数据;
所述运算单元从所述存储单元中提取所述权值数据,对所述权值数据进行正变换,其中,所述运算单元将对所述权值数据的正变换拆解为求和运算,并根据求和运算完成所述权值数据的正变换,获得权值变换结果。
在一些实施例中,所述运算单元包括:第一运算单元和第二运算单元;
所述第一运算单元响应所述第一指令,响应所述第一指令,从所述存储单元提取特征数据,对所述特征数据进行正变换,其中,所述第一运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述第二运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述第二运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
在一些实施例中,所述第二运算单元包括:乘法单元和逆变换单元;
所述乘法单元响应所述第二指令,获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
所述逆变换单元对所述乘法运算结果进行逆变换,其中,所述逆变换单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
在一些实施例中,所述运算单元包括:加法运算单元和乘法运算单元;
所述加法运算单元响应所述第一指令,从所述存储单元获取特征数据,对所述特征数据进行正变换,其中,所述加法运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述乘法运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
所述加法运算单元,还用于响应所述第二指令,对所述乘法运算结果进行逆变换,其中,所述加法运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限 制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件 产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在一种可能的实现方式中,还公开了一种人工智能芯片,其包括了上述运算装置。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
图6示出根据本公开实施例的板卡的结构框图,参阅图6,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM,双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论 带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一种可能的实现方式中,公开了一种电子设备,其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。
所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
A1、一种运算方法,应用于运算装置,所述运算装置包括:控制单元、存储单元、以及运算单元;其中,
所述控制单元发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算,
所述存储单元存储用于winograd卷积运算的数据;
所述运算单元响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
A2、根据条款A1所述的方法,所述运算单元将所述数据拆解为多个子张量;对所 述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果。
A3、根据条款A2所述的运算方法,
所述运算单元从所述数据解析得到多个子张量,其中,所述数据为所述多个子张量之和,所述多个子张量的个数与所述数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述数据中对应位置的非0元素相同。
A4、根据条款A2所述的运算方法,
所述运算单元获取各子张量对应的元子张量的winograd变换结果,其中,所述元子张量是将所述子张量的非0元素置为1的张量;将所述子张量中非0的元素值作为系数乘以对应的元子张量的winograd变换结果,得到所述子张量的winograd变换结果;将多个子张量的winograd变换结果相加得到所述数据的winograd变换结果。
A5、根据条款A4所述的运算方法,所述运算单元对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的winograd变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及winograd变换类型确定的,其中所述winograd变换类型包括正变换的winograd变换类型和逆变换的winograd变换类型。
A6、根据条款A1至A5任一所述的运算方法,
所述数据包括特征数据、权值数据中的至少一种;
所述winograd变换包括正变换和/或逆变换。
A7、根据条款A6所述的运算方法,所述控制指令包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令;
所述运算单元响应所述第一指令,从所述存储单元中提取所述特征数据,对所述特征数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述特征数据的变换运算拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述运算单元还响应所述第二指令获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
A8、根据条款A7所述的运算方法,
所述存储单元接收权值变换结果并存储;
所述运算单元响应所述第二指令,从所述存储单元中提取所述权值变换结果。
A9、根据条款A7所述的运算方法,
所述存储单元存储权值数据;
所述运算单元从所述存储单元中提取所述权值数据,对所述权值数据进行正变换,其中,所述运算单元将对所述权值数据的正变换拆解为求和运算,并根据求和运算完成所述权值数据的正变换,获得权值变换结果。
A10、根据条款A7所述的运算方法,所述运算单元包括:第一运算单元和第二运算单元;
所述第一运算单元响应所述第一指令,响应所述第一指令,从所述存储单元提取特征数据,对所述特征数据进行正变换,其中,所述第一运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述第二运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述第二运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
A11、根据条款A10所述的运算方法,所述第二运算单元包括:乘法单元和逆变换单元;
所述乘法单元响应所述第二指令,获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
所述逆变换单元对所述乘法运算结果进行逆变换,其中,所述逆变换单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
A12、根据条款A7所述的运算方法,所述运算单元包括:加法运算单元和乘法运算单元;
所述加法运算单元响应所述第一指令,从所述存储单元获取特征数据,对所述特征数据进行正变换,其中,所述加法运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
所述乘法运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
所述加法运算单元,还用于响应所述第二指令,对所述乘法运算结果进行逆变换,其中,所述加法运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。

Claims (28)

  1. 一种运算装置,所述装置用于进行winograd卷积运算,其特征在于,所述装置包括:控制单元、存储单元、以及运算单元;
    所述控制单元,用于发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算;
    所述存储单元,用于存储用于winograd卷积运算的数据;
    所述运算单元,用于响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
  2. 根据权利要求1所述的运算装置,其特征在于,
    所述运算单元,具体用于将所述数据拆解为多个子张量;对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果。
  3. 根据权利要求2所述的运算装置,其特征在于,
    所述运算单元,具体用于从所述数据解析得到多个子张量,其中,所述数据为所述多个子张量之和,所述多个子张量的个数与所述数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述数据中对应位置的非0元素相同。
  4. 根据权利要求2所述的运算装置,其特征在于,
    所述运算单元,具体用于获取各子张量对应的元子张量的winograd变换结果,其中,所述元子张量是将所述子张量的非0元素置为1的张量;将所述子张量中非0的元素值作为系数乘以对应的元子张量的winograd变换结果,得到所述子张量的winograd变换结果;将多个子张量的winograd变换结果相加得到所述数据的winograd变换结果。
  5. 根据权利要求4所述的运算装置,其特征在于,所述运算单元,具体用于对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的winograd变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及winograd变换类型确定的,其中所述winograd变换类型包括正变换的winograd变换类型和逆变换的winograd变换类型。
  6. 根据权利要求1至5任一所述的运算装置,其特征在于,
    所述数据包括特征数据、权值数据中的至少一种;
    所述winograd变换包括正变换和/或逆变换。
  7. 根据权利要求6所述的运算装置,其特征在于,
    所述控制指令包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令;
    所述运算单元,用于响应所述第一指令,从所述存储单元中提取所述特征数据,对所述特征数据进行正变换,其中,所述运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    所述运算单元,还用于响应所述第二指令获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
  8. 根据权利要求7所述的运算装置,其特征在于,
    所述存储单元,具体用于接收权值变换结果并存储;
    所述运算单元,具体用于响应所述第二指令,从所述存储单元中提取所述权值变换结果。
  9. 根据权利要求7所述的运算装置,其特征在于,
    所述存储单元,具体用于存储权值数据;
    所述运算单元,具体用于从所述存储单元中提取所述权值数据,对所述权值数据进行正变换,其中,所述运算单元将对所述权值数据的正变换拆解为求和运算,并根据求和运算完成所述权值数据的正变换,获得权值变换结果。
  10. 根据权利要求7所述的运算装置,其特征在于,所述运算单元包括:
    第一运算单元,用于响应所述第一指令,从所述存储单元提取特征数据,对所述特征数据进行正变换,其中,所述第一运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    第二运算单元,用于响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述第二运算单元将所述逆变换中对所述乘法运算结果的逆变换运算拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
  11. 根据权利要求10所述的运算装置,其特征在于,所述第二运算单元包括:
    乘法单元,用于响应所述第二指令,获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
    逆变换单元,用于对所述乘法运算结果进行逆变换,其中,所述逆变换单元将所述逆变换中对所述乘法运算结果的变换运算拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到所述运算结果。
  12. 根据权利要求7所述的运算装置,其特征在于,所述运算单元包括:
    加法运算单元,用于响应所述第一指令,从所述存储单元获取特征数据,对所述特征数据进行正变换,其中,所述加法运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    乘法运算单元,用于响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
    所述加法运算单元,还用于响应所述第二指令,对所述乘法运算结果进行逆变换,其中,所述加法运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
  13. 一种人工智能芯片,其特征在于,所述芯片包括如权利要求1-12中任意一项所述的运算装置。
  14. 一种电子设备,其特征在于,所述电子设备包括如权利要求13所述的人工智能芯片。
  15. 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求13所述的人工智能芯片;
    其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
    所述存储器件,用于存储数据;
    所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
    所述控制器件,用于对所述人工智能芯片的状态进行监控。
  16. 根据权利要求15所述的板卡,其特征在于,
    所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;
    所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;
    所述接口装置为:标准PCIE接口。
  17. 一种运算方法,其特征在于,应用于运算装置,所述运算装置包括:控制单元、存储单元、以及运算单元;其中,
    所述控制单元发送控制指令,所述控制指令用于指示所述运算单元进行winograd卷积运算,
    所述存储单元存储用于winograd卷积运算的数据;
    所述运算单元响应所述控制指令,从所述存储单元中提取数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述数据的变换运算拆解为求和运算,并根据所述求和运算完成所述数据的winograd变换。
  18. 根据权利要求17所述的运算方法,其特征在于,
    所述运算单元将所述数据拆解为多个子张量;对所述多个子张量进行变换运算并求和,根据求和运算的结果得到所述数据的winograd变换结果。
  19. 根据权利要求18所述的运算方法,其特征在于,
    所述运算单元从所述数据解析得到多个子张量,其中,所述数据为所述多个子张量之和,所述多个子张量的个数与所述数据中非0元素的个数相同,每个所述子张量中有单个非0元素,且在所述子张量中的非0元素与在所述数据中对应位置的非0元素相同。
  20. 根据权利要求18所述的运算方法,其特征在于,
    所述运算单元获取各子张量对应的元子张量的winograd变换结果,其中,所述元子张量是将所述子张量的非0元素置为1的张量;将所述子张量中非0的元素值作为系数乘以对应的元子张量的winograd变换结果,得到所述子张量的winograd变换结果;将多个子张量的winograd变换结果相加得到所述数据的winograd变换结果。
  21. 根据权利要求20所述的运算方法,其特征在于,所述运算单元对于每一个所述子张量,将所述子张量对应的元子张量左边乘以左乘矩阵、右边乘以右乘矩阵,得到所述元子张量的winograd变换结果,其中,所述左乘矩阵和所述右乘矩阵都是由所述子张量的规模以及winograd变换类型确定的,其中所述winograd变换类型包括正变换的winograd变换类型和逆变换的winograd变换类型。
  22. 根据权利要求17至21任一所述的运算方法,其特征在于,
    所述数据包括特征数据、权值数据中的至少一种;
    所述winograd变换包括正变换和/或逆变换。
  23. 根据权利要求22所述的运算方法,其特征在于,所述控制指令包括第一指令和第二指令,其中,所述第一指令包括正变换指令,所述第二指令包括对位乘指令和逆变换指令;
    所述运算单元响应所述第一指令,从所述存储单元中提取所述特征数据,对所述 特征数据进行winograd卷积运算,其中,所述运算单元将所述winograd卷积运算中对所述特征数据的变换运算拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    所述运算单元还响应所述第二指令获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
  24. 根据权利要求23所述的运算方法,其特征在于,
    所述存储单元接收权值变换结果并存储;
    所述运算单元响应所述第二指令,从所述存储单元中提取所述权值变换结果。
  25. 根据权利要求23所述的运算方法,其特征在于,
    所述存储单元存储权值数据;
    所述运算单元从所述存储单元中提取所述权值数据,对所述权值数据进行正变换,其中,所述运算单元将对所述权值数据的正变换拆解为求和运算,并根据求和运算完成所述权值数据的正变换,获得权值变换结果。
  26. 根据权利要求23所述的运算方法,其特征在于,所述运算单元包括:第一运算单元和第二运算单元;
    所述第一运算单元响应所述第一指令,响应所述第一指令,从所述存储单元提取特征数据,对所述特征数据进行正变换,其中,所述第一运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    所述第二运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;对所述乘法运算结果进行逆变换,其中,所述第二运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
  27. 根据权利要求26所述的运算方法,其特征在于,所述第二运算单元包括:乘法单元和逆变换单元;
    所述乘法单元响应所述第二指令,获取经过正变换的权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
    所述逆变换单元对所述乘法运算结果进行逆变换,其中,所述逆变换单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结 果的逆变换,得到运算结果。
  28. 根据权利要求23所述的运算方法,其特征在于,所述运算单元包括:加法运算单元和乘法运算单元;
    所述加法运算单元响应所述第一指令,从所述存储单元获取特征数据,对所述特征数据进行正变换,其中,所述加法运算单元将对所述特征数据的正变换拆解为求和运算,并根据所述求和运算完成所述特征数据的正变换,得到特征变换结果;
    所述乘法运算单元响应所述第二指令获取权值变换结果,对所述权值变换结果和特征变换结果进行对位乘,得到乘法运算结果;
    所述加法运算单元,还用于响应所述第二指令,对所述乘法运算结果进行逆变换,其中,所述加法运算单元将对所述乘法运算结果的逆变换拆解为求和运算,并根据所述求和运算完成所述乘法运算结果的逆变换,得到运算结果。
PCT/CN2020/113162 2019-11-01 2020-09-03 运算装置 WO2021082723A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/773,446 US20230039892A1 (en) 2019-11-01 2020-09-03 Operation apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911061951.9 2019-11-01
CN201911061951.9A CN112765542A (zh) 2019-11-01 2019-11-01 运算装置

Publications (1)

Publication Number Publication Date
WO2021082723A1 true WO2021082723A1 (zh) 2021-05-06

Family

ID=75692022

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113162 WO2021082723A1 (zh) 2019-11-01 2020-09-03 运算装置

Country Status (3)

Country Link
US (1) US20230039892A1 (zh)
CN (1) CN112765542A (zh)
WO (1) WO2021082723A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111831337B (zh) * 2019-04-19 2022-11-29 安徽寒武纪信息科技有限公司 数据同步方法及装置以及相关产品

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451652A (zh) * 2016-05-31 2017-12-08 三星电子株式会社 高效的稀疏并行的基于威诺格拉德的卷积方案
CN108875908A (zh) * 2017-05-16 2018-11-23 三星电子株式会社 优化的神经网络输入步长方法及设备
US20180349317A1 (en) * 2017-06-01 2018-12-06 Samsung Electronics Co., Ltd. Apparatus and method for generating efficient convolution
CN109117187A (zh) * 2018-08-27 2019-01-01 郑州云海信息技术有限公司 卷积神经网络加速方法及相关设备
CN109388777A (zh) * 2017-08-07 2019-02-26 英特尔公司 一种用于经优化的Winograd卷积加速器的系统和方法
CN110097172A (zh) * 2019-03-18 2019-08-06 中国科学院计算技术研究所 一种基于winograd卷积运算的卷积神经网络数据处理方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451652A (zh) * 2016-05-31 2017-12-08 三星电子株式会社 高效的稀疏并行的基于威诺格拉德的卷积方案
CN108875908A (zh) * 2017-05-16 2018-11-23 三星电子株式会社 优化的神经网络输入步长方法及设备
US20180349317A1 (en) * 2017-06-01 2018-12-06 Samsung Electronics Co., Ltd. Apparatus and method for generating efficient convolution
CN109388777A (zh) * 2017-08-07 2019-02-26 英特尔公司 一种用于经优化的Winograd卷积加速器的系统和方法
CN109117187A (zh) * 2018-08-27 2019-01-01 郑州云海信息技术有限公司 卷积神经网络加速方法及相关设备
CN110097172A (zh) * 2019-03-18 2019-08-06 中国科学院计算技术研究所 一种基于winograd卷积运算的卷积神经网络数据处理方法及装置

Also Published As

Publication number Publication date
CN112765542A (zh) 2021-05-07
US20230039892A1 (en) 2023-02-09

Similar Documents

Publication Publication Date Title
CN109522052B (zh) 一种计算装置及板卡
US20220091849A1 (en) Operation module and method thereof
CN109543832B (zh) 一种计算装置及板卡
EP3557484A1 (en) Neural network convolution operation device and method
US20230026006A1 (en) Convolution computation engine, artificial intelligence chip, and data processing method
CN109416755B (zh) 人工智能并行处理方法、装置、可读存储介质、及终端
CN112686379B (zh) 集成电路装置、电子设备、板卡和计算方法
CN111028136B (zh) 一种人工智能处理器处理二维复数矩阵的方法和设备
WO2021083101A1 (zh) 数据处理方法、装置及相关产品
WO2021082725A1 (zh) Winograd卷积运算方法及相关产品
WO2021082723A1 (zh) 运算装置
WO2021027972A1 (zh) 数据同步方法及装置以及相关产品
US11874898B2 (en) Streaming-based artificial intelligence convolution processing method and apparatus, readable storage medium and terminal
CN111143766A (zh) 人工智能处理器处理二维复数矩阵的方法和设备
WO2021082746A1 (zh) 运算装置及相关产品
WO2021223642A1 (zh) 数据处理方法及装置以及相关产品
WO2021018313A1 (zh) 数据同步方法及装置以及相关产品
WO2021027973A1 (zh) 数据同步方法及装置以及相关产品
CN111382852B (zh) 数据处理装置、方法、芯片及电子设备
WO2021082724A1 (zh) 运算方法及相关产品
CN111382856B (zh) 数据处理装置、方法、芯片及电子设备
WO2021082722A1 (zh) 运算装置、方法及相关产品
WO2021082747A1 (zh) 运算装置及相关产品
CN112784206A (zh) winograd卷积运算方法、装置、设备及存储介质
WO2021223644A1 (zh) 数据处理方法及装置以及相关产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20882581

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20882581

Country of ref document: EP

Kind code of ref document: A1