WO2021226782A1 - 卷积计算装置、方法和计算机存储介质 - Google Patents

卷积计算装置、方法和计算机存储介质 Download PDF

Info

Publication number
WO2021226782A1
WO2021226782A1 PCT/CN2020/089570 CN2020089570W WO2021226782A1 WO 2021226782 A1 WO2021226782 A1 WO 2021226782A1 CN 2020089570 W CN2020089570 W CN 2020089570W WO 2021226782 A1 WO2021226782 A1 WO 2021226782A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
data
convolution kernel
map data
expanded
Prior art date
Application number
PCT/CN2020/089570
Other languages
English (en)
French (fr)
Inventor
刘子男
仇晓颖
韩彬
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN202080006263.7A priority Critical patent/CN113168429A/zh
Priority to PCT/CN2020/089570 priority patent/WO2021226782A1/zh
Publication of WO2021226782A1 publication Critical patent/WO2021226782A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/544Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices for evaluating functions by calculation
    • G06F7/5443Sum of products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present invention relates to the technical field of convolution calculation, and in particular to a convolution calculation device, method, and computer storage medium.
  • Convolutional neural network is a feedforward neural network composed of one or more convolutional layers, pooling layers, activation layers, and fully connected layers. Compared with other deep learning structures, convolutional neural networks are used in image and speech recognition. Aspect can give better results. Compared with other deep, feed-forward neural networks, convolutional neural networks have the characteristics of weight sharing, data sharing, and sparseness, and require fewer parameters and calculations. It is an attractive deep learning structure.
  • the output feature map of each layer in the convolutional neural network is the sum of the product of a set of weights and its corresponding input feature map, and the output of the previous layer is the input of the next hidden layer or output layer.
  • Convolution calculation is that the weight matrix slides on the feature map matrix with a fixed step. Each time the area on the feature map with the same size as the weight matrix is multiplied by the weight matrix, the multiplication results of different channels are added to obtain the output feature map The process of the previous point.
  • Weight sharing and feature map sharing are two important characteristics of convolutional neural networks. However, when the current convolutional neural network performs convolution calculations, loading weights and feature maps take up a lot of bandwidth, and does not take advantage of the characteristics of feature map sharing and weight sharing. The weights and feature maps loaded each time can only be used once. Calculation, the next calculation needs to be reloaded, but some of the weights and feature maps required for different calculations are duplicated, causing redundant loading overhead.
  • the size of the memory consumed by the neural network becomes an issue that cannot be ignored.
  • the commonly used neural network size is generally between tens of MB to hundreds of MB.
  • the size of the neural network not only brings problems with memory capacity, but also problems with memory bandwidth and battery consumption, which limits the deployment of neural networks on mobile devices.
  • the first aspect of the embodiments of the present invention provides a convolution calculation device, and the convolution calculation device includes:
  • the expansion module is used to obtain feature map data and convolution kernel data, and expand the feature map data and the convolution kernel data according to the accuracy of the feature map data and the convolution kernel data, so that each The number of bits of the feature map data and the convolution kernel data is expanded to M times of N, where N is the minimum number of bits for the feature map data and the convolution kernel data to perform convolution and multiplication operations, and M is Positive integer
  • the pulsation calculation module is used to perform pulsation calculations on the expanded convolution kernel data and the expanded feature map data, and output the calculation results;
  • the timing alignment module is configured to perform timing alignment on the operation result output by the pulsation operation module, and output the aligned operation result.
  • a second aspect of the embodiments of the present invention provides a convolution calculation method, and the convolution calculation method includes:
  • Acquire feature map data and convolution kernel data and expand the feature map data and the convolution kernel data according to the accuracy of the feature map data and the convolution kernel data, so that each feature map
  • the number of bits of data and the convolution kernel data is expanded by M times of N, where N is the minimum number of bits for convolution and multiplication of the feature map data and the convolution kernel data, and M is a positive integer;
  • a third aspect of the embodiments of the present invention provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps of the convolution calculation method of the embodiment of the present invention are implemented.
  • the accuracy of the convolution calculation method, the convolution calculation device and the computer storage medium of the embodiment of the present invention can be dynamically configured, and multiple calculations can be performed by loading data once.
  • Fig. 1 shows a structural block diagram of a convolution calculation device according to an embodiment of the present invention
  • Figure 2 shows a feature map and a convolution kernel according to an embodiment of the present invention
  • Figure 3 shows an expanded feature map and convolution kernel according to an embodiment of the present invention
  • Fig. 4 shows a structural diagram of a convolution calculation device according to an embodiment of the present invention
  • Figure 5 shows a structural diagram of a pulsation unit according to an embodiment of the present invention
  • Fig. 6 shows a systolic operation process of a systolic array with 4-bit precision according to an embodiment of the present invention
  • FIG. 7 shows the distribution of feature map data and expanded convolution kernel data between adjacent systolic units after expansion at 8-bit precision according to an embodiment of the present invention
  • Fig. 8 shows a systolic operation process of a systolic array with 8-bit precision according to an embodiment of the present invention
  • Fig. 9 shows a systolic operation process of a systolic array with 16-bit precision according to an embodiment of the present invention
  • Fig. 10 shows a structural diagram of a sum sub-module according to an embodiment of the present invention.
  • FIG. 11 shows a structural diagram of a timing alignment module according to an embodiment of the present invention.
  • Fig. 12 shows a flowchart of a convolution calculation method according to an embodiment of the present invention.
  • Fig. 1 shows a structural block diagram of a convolution calculation device 100 according to an embodiment of the present invention.
  • the convolution calculation device 100 includes an expansion module 110, a systolic calculation module 120, and a timing alignment module 130.
  • the expansion module 110 is communicatively connected to the systolic calculation module 120, and the systolic calculation module is communicatively connected to the timing alignment module, wherein:
  • the expansion module 110 is used to obtain feature map data and convolution kernel data, and expand the feature map data and the convolution kernel data according to the accuracy of the feature map data and the convolution kernel data, so that each The number of bits of the feature map data and the convolution kernel data is expanded to M times of N, where N is the minimum number of bits for the feature map data and the convolution kernel data to perform convolution and multiplication operations, and M is Positive integer
  • the pulsation calculation module 120 is used to perform pulsation calculations on the expanded convolution kernel data and the expanded feature map data, and output the calculation results;
  • the timing alignment module 130 is configured to perform timing alignment on the operation result output by the pulsation operation module 120, and output the aligned operation result.
  • the embodiment of the present invention proposes that the convolutional calculation device 100 adopts a systolic array structure, and utilizes the characteristics of weight sharing and feature map sharing in the convolutional neural network.
  • the data can be loaded once and multiple calculations can be performed, and the accuracy can be dynamically configured.
  • the minimum number of bits for the convolution multiplication operation of the feature map data and the convolution kernel data in the convolution calculation device 100 is 5 bits, and the convolution calculation device 100 can be configured with 4-bit, 8-bit, or 16-bit precision.
  • the operation is expanded to 5 bits, 10 bits and 20 bits for operation respectively.
  • the 4-bit, 8-bit, and 16-bit precision formats are called HH, CC, and SS formats, respectively.
  • the computing power is 16K MAC@HH, 4K MAC@CC, 1K MAC@SS, which makes full use of the calculations provided by the hardware. force.
  • the feature map and the convolution kernel are first expanded, thereby converting the convolution operation into a matrix multiplication.
  • the convolution operation is to slide the convolution kernel on the feature map in a sliding window, multiply the corresponding elements in the current window, and then sum to obtain the result.
  • the calculation method of the vector inner product is also to multiply and then sum, so the elements in each window can be expanded into a vector, and the vector inner product can be operated.
  • the data required during the operation can be stored in continuous memory, thereby improving the access speed.
  • FIG. 2 shows a feature map and a convolution kernel used for convolution calculation.
  • the size of the feature map is 3 ⁇ 3, there are 2 channels, and there are 2 convolution kernels, namely convolution kernel 0 and convolution kernel 1.
  • the size of each convolution kernel is 2 ⁇ 2.
  • the convolution operation is to slide the 2 ⁇ 2 convolution kernel on the feature map matrix, and each time it multiplies a 2 ⁇ 2 area in the feature map matrix, and the multiplication results of the two channels are added to obtain the output feature map. Of a point.
  • each channel size is 2 ⁇ 2.
  • Figure 3 uses a matrix (img2col) method to expand the feature map and convolution kernel shown in Figure 2 to obtain feature map data and convolution kernel data.
  • the convolution kernel is expanded into a 2 ⁇ 8 matrix in the order of rows, columns, and channels, and the feature map is also expanded into an 8 ⁇ 4 matrix.
  • the convolution kernel matrix is multiplied by each row and each column of the feature map matrix.
  • One point of the output is obtained, so that the original matrix convolution becomes a vector point product.
  • C0 and C1 are loaded into the dot product core 0 and the dot product core 1, respectively, and D3, D2, D1, and D0 are sequentially loaded into the dot product core 0 and multiplied by C0, respectively.
  • C0 is the result of expanding the convolution kernel 0 in the order of rows, columns, and channels
  • D3 is the result of expanding the 2 ⁇ 2 area in the upper left corner of the feature map matrix in the order of rows, columns, and channels.
  • Multiplication is the same as the convolution operation above, and both can get the first point in the first row of the output feature map; based on similar principles, C0 is multiplied by D3, D2, D1, and D0 to get the output feature map channel 0 Of 4 points.
  • D3, D2, D1, D0 are transferred from point multiplication core 0 to point multiplication core 1, and are multiplied by C1 respectively to obtain four points of channel 1 of the output characteristic map. Therefore, the convolution kernel data only needs to be loaded twice, and the feature map data only needs to be loaded 4 times.
  • the convolution kernel data and the feature map data are respectively multiplied to obtain a 2 ⁇ 4 output feature map.
  • the convolution calculation device 100 of the embodiment of the present invention may include an expansion unit for performing the expansion process as described above; or, the convolution calculation device 100 may also directly obtain the expanded feature map data and the volume. Accumulate core data.
  • the feature map data and convolution kernel data can be stored in the storage unit.
  • the control unit determines the feature map data and convolution kernel data that need to be loaded into the convolution calculation device 100, and stores the feature map data and convolution kernel data from the storage unit. The unit is loaded into the convolution calculation device 100.
  • the convolution calculation device 100 may include the storage unit and the control unit, or the convolution calculation device 100 may not include a storage unit, but includes a communication interface for communicating with the storage unit, and convolution calculation
  • the device 100 may further include a buffer unit, which is used to buffer data received from the communication interface.
  • the convolution calculation device 100 includes an expansion module 110, a systolic calculation module 120, and a timing alignment module 130 connected in sequence.
  • the systolic calculation module 120 includes a systolic array 121 and a sum sub-module connected in sequence. 122.
  • the expansion module 110 After the feature map data and convolution kernel data are input to the convolution calculation device 100, they are first expanded by the expansion module 110 to expand the number of bits of each feature map data and convolution kernel data to M times N, where N is the minimum number of bits for the convolution multiplication operation of the feature map data and the convolution kernel data.
  • the expansion method of the expansion module 110 is: if the accuracy of the feature map data and the convolution kernel data is lower than N bits, then each feature map data and each convolution kernel data are filled, so that every Each feature map data and each convolution kernel data are expanded to N bits. If the accuracy of the feature map data and the convolution kernel data is higher than N bits, each feature map data and each convolution kernel data is divided into M N-bit data.
  • the above-mentioned N-bit data adopts the supplementary code format. Wherein, one or more bits in the supplementary code format are called supplementary codes of the supplementary code format.
  • N is equal to 5
  • the feature map data and the convolution kernel data are 4-bit precision, then they will be expanded to 5 bits; if the feature map data and convolution kernel data are 8-bit precision, they will be expanded to 10. Bit; if the feature map data and the convolution kernel data have 16-bit precision, they are expanded to 20 bits.
  • B can be split and expanded into a 4-bit supplementary code and three 5-bit supplementary codes. In order to facilitate calculations, they are uniformly expanded to 5 bits, and the number of expanded bits becomes 20 bits.
  • B can be split into two 5-bit supplementary codes, and the number of bits after extension becomes 10 bits.
  • the method of filling the bits in the supplementary code format is: if the accuracy of the feature map data and the convolution kernel data is lower than N bits, then the feature map data and the convolution kernel data are filled with sign bits ; If the accuracy of the feature map data and the convolution kernel data is higher than N bits, the sign bit is filled in the complement code of the most significant bit of the feature map data and the convolution kernel data, and the remaining M-1 bits are The supplementary code is filled with 0.
  • the padding methods are as follows:
  • the feature map data and convolution kernel data of each precision are uniformly expanded to M times the minimum number of bits N of the convolution operation, and then the corresponding number of pulsation units can be called according to the value of M for calculation , There is no need to uniformly expand all precision data to the highest precision, that is, the compatibility of multi-precision data can be realized, see below for details.
  • the systolic calculation module 120 mainly includes a systolic array 121.
  • the systolic array 121 uses the characteristics of weight sharing and feature map sharing in the convolutional neural network to complete convolution calculations. Once the data is loaded, multiple calculations can be performed, which saves features.
  • the loading time of graph data and convolution kernel data increases the calculation density of convolution and accelerates the speed of neural network inference.
  • the pulsation array 121 includes a plurality of pulsation units, and the number of pulsation units is generally set to a multiple of 4.
  • the number of pulsation units can be 128.
  • the elements of the pulsation propagation (forwarded between the pulsation units of the systolic array 121) in the systolic array 121 are the expanded feature map data, and the expanded convolution kernel data is not included.
  • the product core data is sequentially loaded into each pulsation unit and multiplied by the expanded feature map data, and is no longer propagated by pulsation.
  • the expanded convolution kernel data may be pulsated in the pulsation array 121, and the expanded feature map data may be loaded into each pulsation unit in turn.
  • the following description takes the elements of the pulsation propagation as the expanded feature map data as an example.
  • the expanded feature map data is written to the uppermost pulsation unit in the systolic array at every moment, and the pulsation starts to propagate downward from the uppermost pulsation unit; and the expanded The subsequent convolution kernel data are sequentially written to the pulsation unit according to time.
  • the pulsation unit performs a dot multiplication operation on the expanded feature map data and the expanded convolution kernel data received at the current moment, and outputs the dot multiplication result.
  • the expanded convolution kernel data is sent to each pulsation unit in the form of broadcast and loaded at an appropriate time.
  • the expanded feature map data is always written to the top one or more pulsation units.
  • the pulsation propagates to the lowest pulsation unit, and each calculation cycle performs a dot multiplication on the expanded feature map data and the expanded convolution kernel data in the current pulsation unit. If the pulsation operation has 128 calculation cycles, each pulsation unit will output 128 results, and the 128 pulsation units will output a total of 128 ⁇ 128 results. If the number of pulsation units required is less than 128, only part of the pulsation units will be activated, and the remaining pulsation units will not participate in the calculation.
  • the pulsation unit includes a multi-stage pipeline, wherein the first-stage pipeline in the multi-stage pipeline includes a plurality of multipliers, and the second-stage to the last-stage pipeline in the multi-stage pipeline includes a plurality of adders, respectively Each of the adders receives and adds two output data from the previous pipeline.
  • each multiplier in the first-stage pipeline is used to execute the expanded convolution kernel data and the expanded feature map data Multiply one-to-one correspondence.
  • each multiplier receives a 5-bit feature map data and a 5-bit convolution kernel data for multiplication.
  • each multiplier is used to compare an N-bit supplementary code in the expanded convolution kernel data and the expanded convolution kernel data.
  • An N-bit supplementary code in the feature map data performs one-to-one correspondence multiplication.
  • each multiplier receives a 5-bit supplementary code split from the extended feature map data and a 5-bit split code split from the extended convolution kernel data.
  • the supplementary code is multiplied.
  • the pulsation unit shown in Figure 5 includes an 8-stage pipeline.
  • the first-stage pipeline 502 includes a register for storing the expanded feature map data and a register for storing the expanded convolution kernel data, and 128 5-bit signed multipliers, each of which is used to complete a 5 Multiplication of bit feature map data and a 5-bit convolution kernel data, where the leftmost is MSB (most significant bit), the rightmost is LSB (least significant bit), each multiplier outputs a 9-bit The result is sent to the second-stage pipeline.
  • the second-stage pipeline 504 includes 64 adders, and each adder is used to complete one addition of the two multiplication results output by the first-stage pipeline. It should be noted that, in the CC format and the SS format, due to the different weights of the data, the input on the left side of each adder needs to be shifted 4 bits to the left, and the data in the HH format does not need to be shifted to the left.
  • the third-stage pipeline 506 includes 32 adders.
  • the input on the left side of each adder needs to be shifted to the left by 8 bits, and the data in the SS format and HH format do not need to be shifted to the left due to the different weights of the data.
  • the fourth-stage pipeline 508 to the eighth-stage pipeline 516 respectively include 16, 8, 4, 2, and 1 adders.
  • the adder in each stage of the pipeline is used to add the output results of the two adders of the previous pipeline .
  • the eighth-stage pipeline outputs a 25-bit operation result.
  • FIG. 5 is only an example and not a limitation.
  • the specific structure of the pulsation unit can also be implemented in other ways, as long as the pulsation unit can perform a dot multiplication operation on the expanded feature map data and the expanded convolution kernel data.
  • the expanded feature map data and the expanded convolution kernel data transferred from the expansion module 110 to the pulsation operation module 120 are expanded to M times the minimum number of bits N of the convolution operation, where each pulsation unit is used for Perform the convolution operation with the minimum number of bits N.
  • each individual pulsation unit When the accuracy of the feature map data and the convolution kernel data is lower than N bits, each individual pulsation unit is used to compare the expanded feature map data and the The expanded convolution kernel data is subjected to a dot multiplication operation, and the dot multiplication result is a complete result; when the accuracy of the feature map data and the convolution kernel data is higher than N bits, each individual pulsation unit A part of the dot multiplication operation of the expanded feature map data and the expanded convolution kernel data is completed, and a partial result is output.
  • the partial results output by the M adjacent pulsating units together constitute the complete result of a dot multiplication operation.
  • each systolic unit completes the dot multiplication of a row of expanded convolution kernel data and a column of expanded feature map data, and outputs the complete dot multiplication result;
  • CC 8 bits
  • two adjacent pulsation units jointly complete the dot product of a row of expanded convolution kernel data and a column of expanded feature map data;
  • SS (16-bit) format there are four adjacent pulsation units.
  • the unit jointly completes the dot product of a row of expanded convolution kernel data and a column of expanded feature map data.
  • the expanded feature map data is always sent to the pulsation unit at the top of the systolic array at every moment, and pulsation propagates from the upper pulsation unit to the lower pulsation unit.
  • the convolution kernel data of is sequentially written into pulsation unit 0, pulsation unit 1, pulsation unit 2, and so on.
  • D and C represent the expanded feature map data and the expanded convolution kernel data, respectively, and the numbers after D and C represent the expanded feature map data and the expanded convolution kernel data at different times .
  • the expanded feature map data at each time corresponds to a column of the expanded feature map matrix shown in FIG. 3
  • the convolution kernel data at each time corresponds to the expanded convolution kernel shown in FIG. 3 A row of data.
  • both D0 and C0 are sent to pulsation unit 0 to calculate D0 ⁇ C0.
  • D1 is sent to pulsation unit 0
  • C1 is sent to pulsation unit 1
  • the D0 of pulsation unit 0 propagates to pulsation unit 1.
  • pulsation unit 0 calculates D1 ⁇ C0
  • pulsation unit 1 calculates D0 ⁇ C1.
  • D2 is sent to pulsation unit 0
  • C2 is sent to pulsation unit 2
  • D0 of pulsation unit 1 propagates to pulsation unit 2
  • D1 of pulsation unit 0 propagates to pulsation unit 1.
  • pulsation unit 0 calculates D2 ⁇ C0
  • pulsation unit 1 calculates D1 ⁇ C1
  • pulsation unit 2 calculates D0 ⁇ C2; and so on.
  • each pulsation unit outputs a complete dot product result, and when the accuracy of the feature map data and the convolution kernel data is higher than N bits, each pulsation unit outputs part As a result, two pulsation units are combined to complete the dot product operation of the expanded feature map data and the expanded convolution kernel data.
  • each expanded feature map data is synchronously written to the top M pulsation units, and the M pulsation unit is used as the unit Lower pulsation propagation; interweave each expanded convolution kernel data to obtain M interleaved convolution kernel data, and write the M interleaved convolution kernel data to the pulsation at the time Unit, wherein the M interleaved convolution kernel data are respectively written to M systolic units at each moment.
  • Each of the M pulsation units performs a dot multiplication operation on the expanded feature map data and the interleaved convolution kernel data written at the current moment, and outputs the partial result.
  • the interleaving of the convolution kernel data includes: dividing the expanded convolution kernel data into M supplementary codes of N bits; copying each supplementary code of N bits M times to obtain the interleaved convolution Nuclear data.
  • the 8-bit feature map data and the 8-bit convolution kernel data are respectively expanded into 10-bit extended feature map data D and 10-bit extended convolution.
  • Core data C the expanded feature map data D is split into two 5-bit data D[9:5] and D[4:0]
  • the expanded convolution kernel data C is split into two 5 Bit data C[9:5] and C[4:0]
  • C[9:5] ⁇ (2 4 ⁇ D[9:5]+D[4:0]) is calculated by the pulsation unit above to get the first part of the result
  • C[4:0] ⁇ (2 4 ⁇ D[9:5]+D[4:0]) is calculated by the pulsating unit below to get the second part of the result
  • the sum sub-module 122 can add the first part of the result and the second part of the result to get the complete result .
  • the expanded feature map data D in the CC format is written to pulsation unit 0 and pulsation unit 1, respectively, and propagates downward between two adjacent pulsation units.
  • the two pulsation units in the pulsation unit propagate to the two pulsation units in the next group of pulsation units.
  • the expanded convolution kernel data C will be sequentially written to two adjacent systolic units in different groups of systolic units.
  • the numbers after D and C represent the expanded feature map data and the expanded convolution kernel data at different times, and the subscript represents the result of interleaving the expanded convolution kernel data, C0 0/ 2 and C0 1/2 respectively represent the results of even-numbered interleaving and odd-numbered interleaving on the convolution kernel data at T0 time.
  • D0 is sent to pulsation unit 0 and pulsation unit 1
  • C0 0/2 is sent to pulsation unit 0
  • C0 1/2 is sent to pulsation unit 1.
  • pulsation unit 0 calculates D0 ⁇ C0 0/2
  • pulsation unit 1 calculates D0 ⁇ C0 1/2 , and the two are added together to form D0 ⁇ C0.
  • D1 is sent to pulsation unit 0 and pulsation unit 1
  • C1 0/2 is sent to pulsation unit 2
  • C1 1/2 is sent to pulsation unit 3
  • D0 of pulsation unit 0 and pulsation unit 1 is pulsating at the same time Unit 2 and pulsation unit 3 propagate.
  • pulsation unit 0 calculates D1 ⁇ C0 0/2
  • pulsation unit 1 calculates D1 ⁇ C0 1/2 .
  • the two are added to form D1 ⁇ C0
  • pulsation unit 2 calculates It is D0 ⁇ C1 0/2
  • the pulsation unit 3 calculates D0 ⁇ C1 1/2
  • the two are added to form D0 ⁇ C1.
  • D2 is sent to pulsation unit 0 and pulsation unit 1
  • C2 0/2 is sent to pulsation unit 4
  • C1 1/2 is sent to pulsation unit 5
  • pulsation unit 0 and pulsation unit 1 are pulsed in D1 direction Unit 2 and pulsation unit 3 propagate
  • D0 of pulsation unit 2 and pulsation unit 3 propagates to pulsation unit 4 and pulsation unit 5.
  • pulsation unit 0 calculates D2 ⁇ C0 0/2
  • pulsation unit 1 calculates D2 ⁇ C0 1/2
  • the two are added together to form D2 ⁇ C0
  • pulsation unit 2 calculates D1 ⁇ C1 0/2
  • pulsation unit 3 calculates D1 ⁇ C1 1/2 , and the two are added together to form D1 ⁇ C1
  • pulsation Unit 4 calculates D0 ⁇ C2 0/2
  • pulsation unit 5 calculates D1 ⁇ C2 1/2 , and the two are added to form D1 ⁇ C2; and so on.
  • the expanded feature map data D will be written to pulsation unit 0, pulsation unit 1, pulsation unit 2, and pulsation unit 3, and will propagate downward among the four adjacent pulsation units.
  • the pulsation unit 0, the pulsation unit 1, the pulsation unit 2, and the pulsation unit 3 propagate to the pulsation unit 4, the pulsation unit 5, the pulsation unit 6, and the pulsation unit 7, respectively.
  • the expanded convolution kernel data C will be sequentially written into the four systolic units in different groups of systolic units.
  • each pulsation unit in the HH format, is the complete result of the dot multiplication of the expanded feature map data and the expanded convolution kernel data, but in the CC and SS formats, the output of each pulsation unit It is only the partial result of the dot multiplication of the expanded feature map data and the expanded convolution kernel data, and the addition sub-module 122 also needs to compress and saturate the partial results once, and combine the partial results to form a complete result.
  • the summation sub-module 122 includes a plurality of summation units, and the number of the summation units is one-fourth of the number of pulsation units.
  • the systolic array 121 includes 128 systolic units, the number of summation units is 32, and each summation unit accepts the partial results output by the 4 pulsation units, and performs compression and saturation processing of the partial results according to the current data format. In the end, a complete result of 16 bits is uniformly obtained.
  • the summation unit receives the partial results in the CC and SS formats and the complete results in the HH format output from the pulsation unit 4n to the pulsation unit 4n+3, and processes the data in different formats differently.
  • the SS format since the partial results output by the four pulsation units together constitute a complete result, first use two adders to output the pulsation unit 4n and pulsation unit 4n+1, pulsation unit 4n+2 and pulsation unit 4n+3 respectively. Part of the results are added.
  • the calculation result ss1 is the calculation result of the lower 8 bits
  • the calculation result ss2 is the calculation result of the upper 8 bits.
  • the partial results output by the two pulsation units together constitute a complete result
  • two adders can be used to separately perform the pulsation unit 4n and pulsation unit 4n+1, pulsation unit 4n+2 and pulsation unit 4n+3.
  • the partial results of the output are added and subjected to saturation processing to obtain two complete results. For example, after the data in the pulsation unit 4n+1 is shifted to the left by 4 bits, it is added to the data in the pulsation unit 4n to obtain the calculation result cc1. Next, go through saturation treatment to get the final complete result.
  • the complete result is the complete result in the CC format.
  • the complete result is 8 bits.
  • the data in the pulsation unit 4n+3 is shifted to the left by 4 bits, it is added to the data in the pulsation unit 4n+2 to obtain the calculation result cc2.
  • go through saturation treatment to get the final complete result.
  • the complete result is also the complete result in the CC format.
  • the complete result is 8 bits.
  • the result of the pulsation unit output and the saturation processing is a complete result, and the complete result is 4 bits.
  • the summation unit will output a 16-bit complete result in SS format, two 8-bit complete results in CC format, or four 4-bit complete results in HH format.
  • the complete result output by the summation unit is not aligned in timing.
  • the upper summation unit always outputs earlier than the lower summation unit. Therefore, the timing alignment module 130 needs to be used to triangulate and align the timing of the result output by the summation unit.
  • the characteristic map data expanded under HH, CC, and SS enter the systolic array 121 at different times, and their triangulation methods are also different.
  • the timing alignment module 130 quantifies each summation unit The output of the four complete results in HH format is delayed by one calculation cycle in turn.
  • the timing alignment module 130 outputs the summation unit The output data in the next two CC formats is delayed by one calculation cycle compared with the output data in the last two CC formats.
  • the timing alignment module 130 will output the next summation unit.
  • the complete result in the SS format is delayed by one calculation cycle from the complete result in the SS format output by the previous summation unit.
  • the convolution calculation device 100 uses a systolic array for convolution calculation, which saves the loading time of feature map data and convolution kernel data, increases the calculation density of convolution, and speeds up the neural network. Infer speed.
  • the convolution calculation device 100 can be applied to a variety of precisions, and has high flexibility and compatibility.
  • FIG. 12 shows a flowchart of a convolution calculation method 1200 according to an embodiment of the present invention.
  • the convolution calculation method 1200 can be implemented by the convolution calculation device 100 described above. The following only describes the main steps of the convolution calculation method 1200, and for further details, please refer to the above.
  • the convolution calculation method 1200 includes the following steps:
  • Step S1210 Obtain feature map data and convolution kernel data, and expand the feature map data and the convolution kernel data according to the accuracy of the feature map data and the convolution kernel data, so that each The number of bits of the feature map data and the convolution kernel data is expanded to M times of N, where N is the minimum number of bits for the feature map data and the convolution kernel data to perform convolution and multiplication operations, and M is positive Integer
  • Step S1220 pulsating operation is performed on the expanded convolution kernel data and the expanded feature map data to obtain the operation result;
  • step S1230 time sequence alignment is performed on the operation result, and the aligned operation result is output.
  • the convolution calculation method 1200 provided by the embodiment of the present invention adopts a pulsation calculation method, and utilizes the characteristics of weight sharing and feature map sharing in a convolutional neural network. Data can be loaded once and multiple calculations can be performed, and the accuracy can be dynamically configured.
  • the minimum number of bits for the convolution multiplication operation of the feature map data and the convolution kernel data is 5 bits, and the method can be applied to 4-bit, 8-bit or 16-bit precision, respectively. Expand to 5 bits, 10 bits and 20 bits for calculation.
  • step S1210 after obtaining the feature map and the convolution kernel, in order to facilitate hardware calculations, first expand the feature map and the convolution kernel to obtain the feature map data and the convolution kernel data.
  • the convolution operation is transformed into matrix multiplication.
  • the data required during the operation can be stored in continuous memory, thereby improving the access speed.
  • the acquired feature map data and convolution kernel data are expanded to expand the number of bits of each feature map data and convolution kernel data to M times N, where N is the feature map data and convolution kernel data.
  • N is the feature map data and convolution kernel data.
  • the method of expansion is: if the accuracy of the feature map data and the convolution kernel data is lower than N bits, then each feature map data and each convolution kernel data are filled, so that each The feature map data and each convolution kernel data are expanded to N bits. If the accuracy of the feature map data and the convolution kernel data is higher than N bits, each feature map data and each convolution kernel data is divided into M supplementary codes of equal length, and each supplementary code is filled separately, In order to expand each of the supplementary codes into N bits.
  • N is equal to 5
  • the feature map data and the convolution kernel data are 4-bit precision, then they will be expanded to 5 bits; if the feature map data and convolution kernel data are 8-bit precision, they will be expanded to 10. Bit; if the feature map data and the convolution kernel data have 16-bit precision, they are expanded to 20 bits.
  • the way of filling the supplementary code is: if the accuracy of the feature map data and the convolution kernel data is lower than N bits, then fill the sign bit of the feature map data and the convolution kernel data; if the feature map If the accuracy of the data and the convolution kernel data is higher than N bits, the sign bit of the highest supplement code of the feature map data and the convolution kernel data is filled, and the remaining M-1 supplement codes are filled with 0.
  • the padding methods are as follows:
  • the feature map data and convolution kernel data of each precision are uniformly expanded to M times the minimum number of bits N of the convolution operation, and then the corresponding number of pulsation units can be called according to the value of M for calculation, without all
  • the precision data is uniformly expanded to the highest precision, that is, the compatibility of multi-precision data can be realized, see below for details.
  • step S1220 the systolic calculation is completed based on the systolic array.
  • the systolic array uses the characteristics of weight sharing and feature map sharing in the convolutional neural network to complete the convolution calculation.
  • the data can be loaded once for multiple calculations, which saves the loading time of the feature map data and the convolution kernel data, and increases the convolution.
  • the calculation density of accelerates the speed of neural network inference.
  • the pulsation array includes a plurality of pulsation units, and the number of pulsation units is generally set to a multiple of 4.
  • the number of pulsation units can be 128.
  • the pulsation propagated element in the systolic array is the expanded feature map data, and does not include the expanded convolution kernel data.
  • the convolution kernel data is loaded into each systolic unit and expanded in order.
  • the feature map data of is multiplied, and is no longer propagated by pulsation.
  • the expanded feature map data is written to the uppermost pulsation unit in the pulsation array at each moment, and the pulsation starts to propagate downward from the uppermost pulsation unit; and the expanded volume
  • the core data is sequentially written to the pulsation unit according to time.
  • the pulsation unit performs a dot multiplication operation on the expanded feature map data and the expanded convolution kernel data received at the current moment, and outputs the dot multiplication result.
  • the expanded convolution kernel data is sent to each pulsation unit in the form of broadcast and loaded at an appropriate time.
  • the expanded feature map data is always written to the top one or more pulsation units, and The pulsation propagates to the lowest pulsation unit, and each calculation cycle performs a dot multiplication on the expanded feature map data and the expanded convolution kernel data in the current pulsation unit. If the pulsation operation has 128 calculation cycles, each pulsation unit will output 128 results, and the 128 pulsation units will output a total of 128 ⁇ 128 results. If the number of pulsation units required is less than 128, only part of the pulsation units will be activated, and the remaining pulsation units will not participate in the calculation.
  • the pulsation unit includes a multi-stage pipeline, wherein the first-stage pipeline in the multi-stage pipeline includes a plurality of multipliers, and the second-stage to the last-stage pipeline in the multi-stage pipeline includes a plurality of adders, respectively .
  • the pulsation unit multiplies the expanded feature map data and the expanded convolution kernel data, and sequentially performs multi-level addition on the multiplied results.
  • each multiplier in the first-stage pipeline is used to execute the expanded convolution kernel data and the expanded feature map data Multiply one-to-one correspondence.
  • each multiplier receives a 5-bit feature map data and a 5-bit convolution kernel data for multiplication.
  • each multiplier is used to compare an N-bit supplementary code in the expanded convolution kernel data and the expanded convolution kernel data.
  • An N-bit supplementary code in the feature map data performs one-to-one correspondence multiplication.
  • each multiplier receives a 5-bit supplementary code split from the extended feature map data and a 5-bit split code split from the extended convolution kernel data. The supplementary code is multiplied.
  • the expanded feature map data and the expanded convolution kernel data are expanded to M times the minimum number of bits N of the convolution operation, where each systolic unit is used to perform the convolution operation of the minimum number of bits N, Then, when the accuracy of the feature map data and the convolution kernel data is lower than N bits, each individual pulsation unit is used to perform processing on the expanded feature map data and the expanded convolution kernel data. Dot multiplication operation, the dot multiplication result is a complete result; when the accuracy of the feature map data and the convolution kernel data is higher than N bits, each individual pulsation unit completes the expanded feature map data and the expanded Part of the convolution kernel data is a dot multiplication operation, and a partial result is output.
  • the partial results output by the M adjacent pulsating units together constitute a complete result of a dot multiplication operation.
  • each systolic unit completes the dot multiplication of a row of expanded convolution kernel data and a column of expanded feature map data, and outputs the complete dot multiplication result;
  • CC 8 bits
  • two adjacent pulsation units jointly complete the dot product of a row of expanded convolution kernel data and a column of expanded feature map data;
  • SS (16-bit) format there are four adjacent pulsation units.
  • the unit jointly completes the dot product of a row of expanded convolution kernel data and a column of expanded feature map data.
  • the expanded feature map data is always sent to the pulsation unit at the top of the systolic array at every moment, and pulsation propagates from the upper pulsation unit to the lower pulsation unit, and the expanded convolution
  • the nuclear data is written into the pulsation units in sequence, and each pulsation unit outputs a complete dot multiplication result.
  • the output of each pulsation unit is a partial result, and the two pulsation units can be combined to complete the expanded feature map data and the expanded Dot multiplication operation of convolution kernel data.
  • each expanded feature map data is synchronously written to the top M pulsation units, and the M pulsation unit is used as the unit Lower pulsation propagation; interweave each expanded convolution kernel data to obtain M interleaved convolution kernel data, and write the M interleaved convolution kernel data to the pulsation at the time Unit, wherein the M interleaved convolution kernel data are respectively written to M systolic units at each moment.
  • Each of the M pulsation units performs a dot multiplication operation on the expanded feature map data and the interleaved convolution kernel data written at the current moment, and outputs the partial result.
  • the interleaving of the convolution kernel data includes: dividing the expanded convolution kernel data into M supplementary codes of N bits; copying each supplementary code of N bits M times to obtain the interleaved convolution Nuclear data.
  • the 8-bit feature map data and the 8-bit convolution kernel data are respectively expanded into 10-bit expanded feature map data D and 10-bit expanded convolution kernel data C, after expansion
  • the feature map data D is split into two 5-bit data D[9:5] and D[4:0]
  • the expanded convolution kernel data C is split into two 5-bit data C[9 :5] and C[4:0]
  • C[9:5] ⁇ (2 4 ⁇ D[9:5]+D[4:0]) is calculated by the pulsation unit above to get the first part of the result
  • C[4:0] ⁇ (2 4 ⁇ D[9:5]+D[4:0]) is calculated by the pulsation unit below to get the second part of the result
  • the first part of the result and the second part of the result need to be added to get the complete result.
  • the expanded feature map data is written to 4 adjacent pulsation units at each moment, and the 4 pulsation units are used as the unit to propagate down, and the expanded convolution kernel data is interleaved at the same time, and Write sequentially to the four pulsation units in different groups of pulsation units.
  • the partial results output by the four pulsation units in each group together constitute the complete result.
  • each pulsation unit in the HH format, is the complete result of the dot multiplication of the expanded feature map data and the expanded convolution kernel data, but in the CC and SS formats, the output of each pulsation unit It is only the partial result of the dot multiplication of the expanded feature map data and the expanded convolution kernel data, and the partial results need to be compressed and saturated once, and the partial results are combined to form a complete result.
  • timing alignment Since the complete result obtained by the addition is not aligned in timing, it is necessary to triangulate and align the timing of the result output by the addition unit.
  • the specific manner of timing alignment can be referred to the above, which will not be repeated here.
  • the convolution calculation method 1200 uses a systolic array for convolution calculation, which saves the loading time of feature map data and convolution kernel data, increases the calculation density of convolution, and speeds up the neural network. Infer speed.
  • the convolution calculation method 1200 is applicable to a variety of precisions, and has high flexibility and compatibility.
  • the embodiment of the present invention also provides a computer storage medium on which a computer program is stored.
  • the computer storage medium is a computer readable storage medium.
  • the computer storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), a portable compact disk read-only memory ( CD-ROM), USB memory, or any combination of the above storage media.
  • the computer-readable storage medium may be any combination of one or more computer-readable storage media.
  • the computer program instructions stored on the computer storage medium cause the computer or the processor to perform the following steps when being executed by the computer or the processor:
  • Acquire feature map data and convolution kernel data and expand the feature map data and the convolution kernel data according to the accuracy of the feature map data and the convolution kernel data, so that each feature map
  • the number of bits of data and the convolution kernel data is expanded by M times of N, where N is the minimum number of bits for convolution and multiplication of the feature map data and the convolution kernel data, and M is a positive integer;
  • an embodiment of the present invention also provides a computer program product, which contains instructions, which when executed by a computer, cause the computer to execute the steps of the convolution calculation method 1200 shown in FIG. 12.
  • the convolution calculation method, convolution calculation device, and computer storage medium of the embodiments of the present invention use a systolic array for convolution calculation, which saves the loading time of feature map data and convolution kernel data, and increases convolution.
  • the calculation density of accelerates the speed of neural network inference.
  • the aforementioned convolution calculation device and method can be applied to a variety of precisions, and have high flexibility and compatibility.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center integrated with one or more available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a digital video disc (DVD)), or a semiconductor medium (for example, a solid state disk (SSD)), etc.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present invention essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another device, or some features can be ignored or not implemented.
  • the various component embodiments of the present invention may be implemented by hardware, or by software modules running on one or more processors, or by a combination of them.
  • a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some modules according to the embodiments of the present invention.
  • DSP digital signal processor
  • the present invention can also be implemented as a device program (for example, a computer program and a computer program product) for executing part or all of the methods described herein.
  • Such a program for realizing the present invention may be stored on a computer-readable medium, or may have the form of one or more signals.
  • Such a signal can be downloaded from an Internet website, or provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Neurology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Complex Calculations (AREA)

Abstract

一种卷积计算装置(100)、方法和计算机存储介质,所述卷积计算装置(100)包括:扩展模块(110),用于获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其中N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;脉动运算模块(120),用于对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,并输出运算结果;时序对齐模块(130),用于对所述脉动运算模块输出的所述运算结果进行时序对齐,并输出对齐后的运算结果。该卷积计算方法精度可动态配置,并且载入一次数据可进行多次计算。

Description

卷积计算装置、方法和计算机存储介质 技术领域
本发明涉及卷积计算技术领域,具体而言涉及一种卷积计算装置、方法和计算机存储介质。
背景技术
卷积神经网络是一种由一个或多个卷积层、池化层、激活层和全连接层组成的前馈神经网络,与其他深度学习结构相比,卷积神经网络在图像和语音识别方面能够给出更好的结果。相比其他深度、前馈神经网络,卷积神经网络具有权重共享、数据共享、稀疏化的特点,需要的参数量和计算量更少,是一种颇具吸引力的深度学习结构。卷积神经网络中每一层的输出特征图是一组权重和其对应输入特征图的乘积之和,上一个层的输出又是下一个隐藏层或输出层的输入。
卷积计算是权重矩阵在特征图矩阵上以固定步长滑动,每次取特征图上与权重矩阵相同大小的区域与权重矩阵相乘,不同通道的相乘结果进行相加后得到输出特征图上一个点的过程。权重共享和特征图共享是卷积神经网络的两个重要特性。然而,目前的卷积神经网络在进行卷积计算时,加载权重和特征图占用了大量带宽,并且没有利用特征图共享和权重共享的特点,每次加载的权重和特征图只能用于一次计算,下次计算时需要重新加载,但不同计算时所需要的权重和特征图有一部分是重复的,造成了多余的载入开销。
随着网络规模越来越大,神经网络消耗的内存大小成为不可忽视的问题,尤其在移动设备上,常用的神经网络大小(单精度格式)一般在几十MB至几百MB之间。神经网络的大小不仅带来了内存容量的问题,更带来了内存带宽和电池消耗的问题,限制了神经网络在移动设备上的部署。
此外,目前的量化方案通常是以单精度格式进行网络的训练,训练完成后将网络转换为16比特、8比特、4比特的低精度定点格式, 再部署到移动设备上进行推断。这要求移动端能够支持多精度下的神经网络推断,通常,采用将低精度扩展为高精度的方式,即将4比特、8比特统一扩展到16比特,从而实现对多种精度的兼容。然而,这种方式在低精度下无法充分利用算力和IO(输入输出)带宽,精度每下降1倍,就浪费了75%的运算能力,以及浪费了50%的IO带宽。
发明内容
在发明内容部分中引入了一系列简化形式的概念,这将在具体实施方式部分中进一步详细说明。本发明的发明内容部分并不意味着要试图限定出所要求保护的技术方案的关键特征和必要技术特征,更不意味着试图确定所要求保护的技术方案的保护范围。
针对现有技术的不足,本发明实施例第一方面提供了一种卷积计算装置,所述卷积计算装置包括:
扩展模块,用于获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其中N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
脉动运算模块,用于对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,并输出运算结果;
时序对齐模块,用于对所述脉动运算模块输出的所述运算结果进行时序对齐,并输出对齐后的运算结果。
本发明实施例第二方面提供了一种卷积计算方法,所述卷积计算方法包括:
获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,以得到运算结果;
对所述运算结果进行时序对齐,并输出对齐后的运算结果。
本发明实施例第三方面提供了一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现本发明实施例的卷积计算方法的步骤。
本发明实施例的卷积计算方法、卷积计算装置和计算机存储介质精度可动态配置,并且载入一次数据可进行多次计算。
附图说明
本发明的下列附图在此作为本发明的一部分用于理解本发明。附图中示出了本发明的实施例及其描述,用来解释本发明的原理。
附图中:
图1示出了根据本发明一实施例的卷积计算装置的结构框图;
图2示出了根据本发明一实施例的特征图和卷积核;
图3示出了根据本发明一实施例的展开后的特征图和卷积核;
图4示出了根据本发明一实施例的卷积计算装置的结构图;
图5示出了根据本发明一实施例的脉动单元的结构图;
图6示出了根据本发明一实施例的脉动阵列在4比特精度下的脉动运算过程;
图7示出了根据本发明一实施例的在8比特精度下扩展后的特征图数据和扩展后的卷积核数据在相邻脉动单元之间的分配;
图8示出了根据本发明一实施例的脉动阵列在8比特精度下的脉动运算过程;
图9示出了根据本发明一实施例的脉动阵列在16比特精度下的脉动运算过程;
图10示出了根据本发明一实施例的加和子模块的结构图;
图11示出了根据本发明一实施例的时序对齐模块的结构图;
图12示出了根据本发明一实施例的卷积计算方法的流程图。
具体实施方式
为了使得本发明的目的、技术方案和优点更为明显,下面将参照附图详细描述根据本发明的示例实施例。显然,所描述的实施例仅仅 是本发明的一部分实施例,而不是本发明的全部实施例,应理解,本发明不受这里描述的示例实施例的限制。基于本发明中描述的本发明实施例,本领域技术人员在没有付出创造性劳动的情况下所得到的所有其它实施例都应落入本发明的保护范围之内。
在下文的描述中,给出了大量具体的细节以便提供对本发明更为彻底的理解。然而,对于本领域技术人员而言显而易见的是,本发明可以无需一个或多个这些细节而得以实施。在其他的例子中,为了避免与本发明发生混淆,对于本领域公知的一些技术特征未进行描述。
应当理解的是,本发明能够以不同形式实施,而不应当解释为局限于这里提出的实施例。相反地,提供这些实施例将使公开彻底和完全,并且将本发明的范围完全地传递给本领域技术人员。
在此使用的术语的目的仅在于描述具体实施例并且不作为本发明的限制。在此使用时,单数形式的“一”、“一个”和“所述/该”也意图包括复数形式,除非上下文清楚指出另外的方式。还应明白术语“组成”和/或“包括”,当在该说明书中使用时,确定所述特征、整数、步骤、操作、元件和/或部件的存在,但不排除一个或更多其它的特征、整数、步骤、操作、元件、部件和/或组的存在或添加。在此使用时,术语“和/或”包括相关所列项目的任何及所有组合。
为了彻底理解本发明,将在下列的描述中提出详细的步骤以及详细的结构,以便阐释本发明提出的技术方案。本发明的较佳实施例详细描述如下,然而除了这些详细描述外,本发明还可以具有其他实施方式。
下面结合附图,对本发明实施例的卷积计算装置、卷积计算方法及计算机存储介质进行详细说明。在不冲突的情况下,下述的实施例及实施方式中的特征可以相互组合。
图1示出了根据本发明的一个实施例的卷积计算装置100的结构框图。如图1所示,卷积计算装置100包括扩展模块110、脉动运算模块120和时序对齐模块130,扩展模块110与脉动运算模块120通信连接,脉动运算模块与时序对齐模块通信连接,其中:
扩展模块110用于获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核 数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其中N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
脉动运算模块120用于对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,并输出运算结果;
时序对齐模块130用于对所述脉动运算模块120输出的所述运算结果进行时序对齐,并输出对齐后的运算结果。
本发明实施例提出了卷积计算装置100采用脉动阵列结构,利用了卷积神经网络中权重共享和特征图共享的特点,载入一次数据可以进行多次计算,并且精度可动态配置。示例性地,该卷积计算装置100中特征图数据和卷积核数据进行卷积乘法运算的最小位数为5比特,该卷积计算装置100可以配置成4比特、8比特或16比特精度,运算时分别扩展为5比特、10比特和20比特进行运算。本文中将4比特、8比特和16比特精度的格式分别称为HH、CC和SS格式,算力分别为16K MAC@HH,4K MAC@CC,1K MAC@SS,充分利用了硬件提供的算力。
在一个实施例中,在获取特征图和卷积核之后,为了方便硬件计算,首先将特征图和卷积核展开,从而将卷积运算转化成矩阵乘法。卷积运算是将卷积核以滑动窗口的方式在特征图上滑动,将当前窗口内对应元素相乘然后求和得到结果。向量内积的计算方式同样是相乘之后求和,因而可以将每个窗口内的元素展开为向量,通过向量内积进行运算。将卷积运算转化为矩阵乘法后,运算时需要的数据可以存放在连续的内存上,从而提升了访问速度。
例如,参照图2和图3,图2示出了用于卷积计算的特征图和卷积核。在图2的示例中,特征图的大小为3×3,有2个通道,卷积核有2个,即卷积核0和卷积核1,每个卷积核的大小均为2×2,有2个通道。卷积运算为将2×2的卷积核在特征图矩阵上滑动,每次与特征图矩阵中的一个2×2的区域相乘,两个通道的相乘结果相加得到输出特征图上的一个点。例如,首先将通道0的卷积核0的
Figure PCTCN2020089570-appb-000001
与特征图矩阵左上角的
Figure PCTCN2020089570-appb-000002
相乘,将通道1的卷积核
Figure PCTCN2020089570-appb-000003
与特征图矩阵左上角的
Figure PCTCN2020089570-appb-000004
相乘,将二者相加可以得到输出特征图通道0第一行的第一个点;将通道0的卷积核
Figure PCTCN2020089570-appb-000005
与特征图矩阵左上角的
Figure PCTCN2020089570-appb-000006
相乘,将通道1的卷积核
Figure PCTCN2020089570-appb-000007
与特征图矩阵左上角的
Figure PCTCN2020089570-appb-000008
相乘,将二者相加可以得到输出特征图通道1第一行第一个点,以此类推,最终得到一个2通道的输出特征图,每个通道大小为2×2。
图3使用矩阵化(img2col)的方式将图2所示的特征图和卷积核展开后得到特征图数据和卷积核数据。其中,将卷积核按照行、列、通道的顺序展开为2×8的矩阵,将特征图同样展开为8×4的矩阵,卷积核矩阵和每一行和特征图矩阵的每一列相乘就得到输出的一个点,从而将原本的矩阵卷积变为向量点乘。具体地,如图3所示,将C0和C1分别加载到点乘核0和点乘核1中,将D3、D2、D1、D0依次加载到点乘核0中与C0分别相乘。其中,C0为将卷积核0按照行、列、通道的顺序展开所得,D3为将特征图矩阵中左上角2×2的区域按照行、列、通道的顺序展开所得,将C0与D3相乘与上文中的卷积运算相同,均能得到输出特征图中第一行的第一个点;基于类似的原理,C0分别于D3、D2、D1、D0相乘可以得到输出特征图通道0的4个点。之后,D3、D2、D1、D0由点乘核0传递至点乘核1,与C1分别相乘,从而得到输出特征图通道1的四个点。由此,卷积核数据只需要加载2次,特征图数据只需要加载4次,卷积核数据和特征图数据分别相乘即可以得到2×4的输出特征图。
可选地,本发明实施例的卷积计算装置100可以包括展开单元,用于执行如上所述的展开过程;或者,所述卷积计算装置100也可以直接获取展开后的特征图数据和卷积核数据。特征图数据和卷积核数据可以存储在存储单元中,由控制单元确定需要载入卷积计算装置100的特征图数据和卷积核数据,并将该特征图数据和卷积核数据从存储单元载入到卷积计算装置100中。本发明实施例提供的卷积计算装置100可以包括所述存储单元和所述控制单元,或者,卷积计算装置100可以不包括存储单元,但包括与存储单元进行通信的通信接口,卷积计算装置100还可以包括缓存单元,该缓存单元用于对从通信接口接收的数据进行缓存。
本发明实施例所提供的卷积计算装置100的具体结构参照图4。如图4所示,卷积计算装置100包括依次连接的扩展模块110、脉动 运算模块120和时序对齐模块130,在一些实施例中,脉动运算模块120包括依次连接的脉动阵列121和加和子模块122。特征图数据和卷积核数据输入到卷积计算装置100后,首先由扩展模块110对其进行扩展,以将每个特征图数据和卷积核数据的位数扩展为N的M倍,其中N为特征图数据和卷积核数据进行卷积乘法运算的最小位数。
在一个实施例中,扩展模块110扩展的方式为:若特征图数据和卷积核数据的精度低于N位,则对每个特征图数据和每个卷积核数据进行填充,以将每个特征图数据和每个卷积核数据扩展为N位。若特征图数据和卷积核数据的精度高于N位,则将每个特征图数据和每个卷积核数据划分为M个N位数据。上述所说的N位数据均采用补充码格式。其中,在所述补充码格式中的一个或多个比特称为所述补充码格式的补充码。
例如,若N等于5,则若特征图数据和卷积核数据为4比特精度,则将其扩展为5比特;若特征图数据和卷积核数据为8比特精度,则将其扩展为10比特;若特征图数据和卷积核数据为16比特精度,则将其扩展为20比特。
采用上述方式进行扩展的主要目的在于便于计算。例如,对于一个16比特的补充码格式B,有
Figure PCTCN2020089570-appb-000009
Figure PCTCN2020089570-appb-000010
Figure PCTCN2020089570-appb-000011
因而可以将B拆分和扩展为一个4比特的补充码和3个5比特的补充码,为了便于计算,统一扩展到5比特,扩展后的位数变为20比特。同理,对于8比特补充码格式B,可以将B拆分成2个5比特的补充码,则扩展后的位数变为10比特。
进一步地,对补充码格式中的比特进行填充的方式为:若特征图数据和卷积核数据的精度低于N位,则对所述特征图数据和所述卷积核数据进行符号位填充;若特征图数据和卷积核数据的精度高于N位,则对所述特征图数据和所述卷积核数据的最高位比特的补充码进行符号位填充,其余M-1个比特的补充码用0进行填充。
例如,对于HH(4比特)、CC(8比特)和SS(16比特)精度的格式,填充方式分别如下:
1)HH格式下,对每个特征图数据(data)和每个卷积核数据(coeff) 进行符号位填充,即data[3:0]变为{data[3],data[3:0]},coeff[3:0]变为{coeff[3],coeff[3:0]};
2)CC格式下,对每个特征图数据(data)和每个卷积核数据(coeff)的低4位进行0填充,高4位进行符号位填充,即data[7:0]变为{data[7],data[7:4],0,data[3:0]},coeff[7:0]变为{coeff[7],coeff[7:4],0,coeff[3:0]};
3)SS格式下,对每个特征图数据(data)和每个卷积核数据(coeff)的最高一组4比特进行符号位填充,其余三组进行0填充,即data[15:0]变为{data[15],data[15:12],0,data[11:8],0,data[7:4],0,data[3:0]},coeff[15:0]变为{coeff[15],coeff[15:12],0,coeff[11:8],0,coeff[7:4],0,coeff[3:0]}。
经过扩展模块110的拆分和扩展以后,各精度的特征图数据和卷积核数据统一扩展为卷积运算最小位数N的M倍,后续可以根据M的数值调用相应数目的脉动单元进行计算,无需将所有精度的数据统一扩展为最高精度,即可以实现多精度数据的兼容,具体参见下文。
继续参照图4,脉动运算模块120主要包括脉动阵列121,脉动阵列121利用卷积神经网络中权重共享和特征图共享的特点完成卷积计算,载入一次数据可以进行多次计算,节约了特征图数据和卷积核数据的载入时间,增加了卷积的计算密度,加快了神经网络推断速度。
脉动阵列121包括多个脉动单元,脉动单元的数目一般设置为4的倍数,例如,脉动单元的数目可以为128个。
在本申请一个实施例中,在脉动阵列121中脉动传播(在脉动阵列121的各个脉动单元之间转发)的元素为扩展后的特征图数据,而不包括扩展后的卷积核数据,卷积核数据按次序载入每个脉动单元与扩展后的特征图数据进行相乘,而不再被脉动传播。在其他实施例中,也可以将扩展后的卷积核数据在脉动阵列121中进行脉动传播,扩展后的特征图数据依次载入每个脉动单元。以下以脉动传播的元素为扩展后的特征图数据为例进行描述。
在一个实施例中,将所述扩展后的特征图数据在每一时刻写入到脉动阵列中最上方的脉动单元,并自最上方的所述脉动单元开始向下 脉动传播;将所述扩展后的卷积核数据按时刻依次写入到所述脉动单元。脉动单元对当前时刻接收到的所述扩展后的特征图数据和扩展后的卷积核数据进行点乘操作,并输出点乘结果。
具体地,扩展后的卷积核数据以广播的形式送入到每一个脉动单元并在合适的时刻加载,扩展后的特征图数据总是被写入到最上方的一个或多个脉动单元,并向最下方的脉动单元脉动传播,每一个计算周期对当前脉动单元内的扩展后的特征图数据和扩展后的卷积核数据进行一次点乘。如果脉动运算共有128个计算周期,那么每个脉动单元都会输出128个结果,128个脉动单元共输出128×128个结果。如果需要的脉动单元数目小于128,则只有部分脉动单元会被激活,剩余脉动单元不会参与运算。
在一个实施例中,所述脉动单元包括多级流水线,其中,多级流水线中的第一级流水线包括多个乘法器,多级流水线中的第二级至最后一级流水线分别包括若干加法器,每个所述加法器接收上一级流水线的两个输出数据并进行相加。
其中,当所述特征图数据和所述卷积核数据的精度低于N位时,第一级流水线中的每个乘法器用于对扩展后的卷积核数据和扩展后的特征图数据执行一一对应相乘。例如,在HH格式下,每个乘法器接收一个5比特的特征图数据和一个5比特的卷积核数据以进行相乘。当所述特征图数据和所述卷积核数据的精度高于N位时,每个乘法器用于对所述扩展后的卷积核数据中的一个N位的补充码和所述扩展后的特征图数据中的一个N位的补充码执行一一对应相乘。例如,在CC格式或SS格式下,每个乘法器接收从扩展后的特征图数据中拆分出的一个5比特的补充码和从扩展后的卷积核数据中拆分出的一个5比特的补充码进行相乘。关于不同格式下脉动单元进行的运算的具体细节参见下文。
下面参照图5,继续以N等于5为例对根据本发明一个实施例的脉动单元进行详细描述。
图5所示的脉动单元包括8级流水线。其中,第一级流水线502包括用于存储的扩展后的特征图数据的寄存器和存储扩展后的卷积核数据的寄存器,以及128个5比特有符号乘法器,每个乘法器用于 完成一个5比特的特征图数据和一个5比特的卷积核数据的相乘,其中最左侧为MSB(最高有效位),最右侧为LSB(最低有效位),每个乘法器输出一个9比特的结果并送入第二级流水线。
第二级流水线504包括64个加法器,每个加法器用于对第一级流水线输出的两个相乘结果完成一次相加。需要注意的是,在CC格式和SS格式下,由于数据的权重不同,每个加法器左侧的输入需要左移4比特,HH格式的数据无需左移。
第三级流水线506包括32个加法器,在CC格式下,同样由于数据的权重不同,每个加法器左侧的输入需要左移8比特,SS格式和HH格式的数据无需左移。
第四级流水线508至第八级流水线516分别包括16、8、4、2、1个加法器,每一级流水线中的加法器用于对上一级流水线两个加法器输出的结果进行相加。最终第八级流水线输出一个25比特的运算结果。
应理解,图5所示的结构仅为示例而非限定。脉动单元的具体结构还可以采用其他实现方式,只要能够使得脉动单元对扩展后的特征图数据和扩展后的卷积核数据执行点乘操作即可。
如上所述,扩展模块110传递到脉动运算模块120的扩展后的特征图数据和扩展后的卷积核数据被扩展为卷积运算最小位数N的M倍,其中,每个脉动单元用于执行最小位数N的卷积运算,则当所述特征图数据和所述卷积核数据的精度低于N位时,每个单独的脉动单元用于对所述扩展后的特征图数据和所述扩展后的卷积核数据进行点乘操作,所述点乘结果为完整结果;当所述特征图数据和所述卷积核数据的精度高于N位时,每个单独的脉动单元完成扩展后的特征图数据和扩展后的卷积核数据的一部分点乘操作,并输出部分结果,M个相邻的脉动单元所输出的部分结果共同构成一次点乘操作的完整结果。
例如,对于HH(4比特)格式,则由每个脉动单元完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘,输出完整的点乘结果;对于CC(8比特)格式,则由两个相邻的脉动单元共同完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘;而对 于SS(16比特)格式,则由四个相邻的脉动单元共同完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘。
下面,参照图6描述根据本发明一个实施例的HH格式下由脉动阵列121实现的脉动运算过程。
如图6所示,在HH格式下,扩展后的特征图数据在每一时刻总是被送到脉动阵列最上方的脉动单元,并从上方的脉动单元向下方的脉动单元脉动传播,扩展后的卷积核数据按照顺序依次写入脉动单元0、脉动单元1、脉动单元2,以此类推。在图6中,D和C分别代表扩展后的特征图数据和扩展后的卷积核数据,D和C之后的数字代表了不同时刻的扩展后的特征图数据和扩展后的卷积核数据。作为示例,每一时刻的扩展后的特征图数据对应于图3所示的展开后的特征图矩阵的一列,每一时刻的卷积核数据对应于图3所示的展开后的卷积核数据的一行。
在T0时刻,D0和C0都被送入脉动单元0,以计算D0·C0。在T1时刻,D1被送往脉动单元0,C1被送往脉动单元1,同时脉动单元0的D0向脉动单元1传播,此时脉动单元0计算D1·C0,脉动单元1计算D0·C1。在T2时刻,D2被送往脉动单元0,C2被送往脉动单元2,同时脉动单元1的D0向脉动单元2传播,脉动单元0的D1向脉动单元1传播,此时脉动单元0计算D2·C0,脉动单元1计算D1·C1,脉动单元2计算D0·C2;以此类推。
如上所述,在HH格式下,每个脉动单元输出的是完整的点乘结果,而当特征图数据和所述卷积核数据的精度高于N位时,每个脉动单元输出的是部分结果,由两个脉动单元组合在一起才能完成扩展后的特征图数据和扩展后的卷积核数据的点乘操作。
具体地,当特征图数据和卷积核数据的精度高于N位时,将每个扩展后的特征图数据同步写入到最上方的M个脉动单元,并以M个脉动单元为单位向下脉动传播;将每个扩展后的卷积核数据进行交织,以获得M个交织后的卷积核数据,并将所述M个交织后的卷积核数据按时刻写入到所述脉动单元,其中在每一时刻将所述M个交织后的卷积核数据分别写入到M个脉动单元。M个脉动单元中的每个脉动单元对当前时刻写入的所述扩展后的特征图数据和所述交织 后的卷积核数据进行点乘操作,并输出所述部分结果。
作为示例,对卷积核数据进行的交织包括:将扩展后的卷积核数据分割为M个N位的补充码;将每个N位的补充码复制M次,以得到交织后的卷积核数据。通过上述运算方式,可以使M个脉动单元输出的部分结果构成完整结果。
例如,在CC格式下,如图7所示,8比特的特征图数据和8比特的卷积核数据分别被扩展为10比特的扩展后的特征图数据D和10比特的扩展后的卷积核数据C,扩展后的特征图数据D被拆分为2个5比特的数据D[9:5]和D[4:0],扩展后的卷积核数据C被拆分为2个5比特的数据C[9:5]和C[4:0],则D×C=(2 4×D[9:5]+D[4:0])×(2 4×C[9:5]+C[4:0])=2 4×C[9:5]×(2 4×D[9:5]+D[4:0])+C[4:0]×(2 4×D[9:5]+D[4:0]);
其中,C[9:5]×(2 4×D[9:5]+D[4:0])由上方的脉动单元计算以得到第一部分结果,C[4:0]×(2 4×D[9:5]+D[4:0])由下方的脉动单元计算以得到第二部分结果,最终可以由加和子模块122对第一部分结果和第二部分结果进行相加,得到完整结果。
如图8所示,CC格式下的扩展后的特征图数据D被分别写入到脉动单元0和脉动单元1,并在相邻的两个脉动单元间向下传播,由上一组脉动单元中的两个脉动单元传播到下一组脉动单元中的两个脉动单元。而扩展后的卷积核数据C经过交织后,会按顺序写入到不同组脉动单元内的两个相邻的脉动单元。
在图8中,D和C之后的数字表示不同时刻的扩展后的特征图数据和扩展后的卷积核数据,下标表示对扩展后的卷积核数据进行交织后的结果,C0 0/2和C0 1/2分别表示对T0时刻的卷积核数据进行偶数交织和奇数交织所得的结果。对扩展后的卷积核数据C0进行交织时,将C0[9:5]和C0[4:0]分别复制两次,得到C0 0/2={C0[4:0],C0[4:0]},C 1/2={C0[9:5],C0[9:5]};之后,将C0 0/2和C0 0/2分别写入到脉动单元0和脉动单元1中。
由此,在T0时刻,D0被送往脉动单元0和脉动单元1,C0 0/2被送往脉动单元0,C0 1/2被送往脉动单元1,此时脉动单元0计算的是D0·C0 0/2,脉动单元1计算的是D0·C0 1/2,二者相加构成D0·C0。
在T1时刻,D1被送往脉动单元0和脉动单元1,C1 0/2被送往脉动单元2,C1 1/2被送往脉动单元3,同时脉动单元0和脉动单元1的D0向脉动单元2和脉动单元3传播,此时脉动单元0计算的是D1·C0 0/2,脉动单元1计算的是D1·C0 1/2,二者相加构成D1·C0;脉动单元2计算的是D0·C1 0/2,脉动单元3计算的是D0·C1 1/2,二者相加构成D0·C1。
在T2时刻,D2被送往脉动单元0和脉动单元1,C2 0/2被送往脉动单元4,C1 1/2被送往脉动单元5,同时脉动单元0和脉动单元1的D1向脉动单元2和脉动单元3传播,脉动单元2和脉动单元3的D0向脉动单元4和脉动单元5传播,此时脉动单元0计算的是D2·C0 0/2,脉动单元1计算的是D2·C0 1/2,二者相加构成D2·C0;脉动单元2计算的是D1·C1 0/2,脉动单元3计算的是D1·C1 1/2,二者相加构成D1·C1;脉动单元4计算的是D0·C2 0/2,脉动单元5计算的是D1·C2 1/2,二者相加构成D1·C2;以此类推。
至于SS格式下,需要4个脉动单元输出的部分结果才能构成扩展后的特征图数据和扩展后的卷积核数据的完整点乘结果。
参照图9,在T0时刻,扩展后的特征图数据D会写入到脉动单元0、脉动单元1、脉动单元2和脉动单元3,并在相邻的4个脉动单元间向下传播,由脉动单元0、脉动单元1、脉动单元2和脉动单元3分别传播到脉动单元4、脉动单元5、脉动单元6和脉动单元7。扩展后的卷积核数据C经过交织后,会按顺序写入到不同组脉动单元中的四个脉动单元。
其中,对扩展后的卷积核数据C进行交织时,将由卷积核数据C拆分出的C[19:15]、C[14:10]、C[9:5]和C[4:0]分别复制四次,得到:C 0/4={C[4:0],C[4:0],C[4:0],C[4:0]};C 1/4={C[9:5],C[9:5],C[9:5],C[9:5]};C 2/4={C[14:10],C[14:10],C[14:10],C[14:10]};C 3/4={C[19:15],C[19:15],C[19:15],C[19:15]},并分别写入到相邻的四个脉动单元。
综上,在HH格式下,每个脉动单元输出的是扩展后的特征图数据与扩展后的卷积核数据进行点乘的完整结果,但在CC、SS格式下,每个脉动单元输出的仅仅是扩展后的特征图数据与扩展后的卷积核数据进行点乘的部分结果,还需要由加和子模块122对部分结果进行 一次压缩和饱和处理,将部分结果组合构成完整结果。
在一个实施例中,加和子模块122包括多个加和单元,加和单元的数目为脉动单元数目的四分之一。当脉动阵列121包括128个脉动单元时,加和单元的数目为32个,每个加和单元接受4个脉动单元输出的部分结果,并根据当前的数据格式进行部分结果的压缩、饱和处理,最终统一得到16比特的完整结果。
单个加和单元的示例性结构如图10所示。加和单元接收脉动单元4n至脉动单元4n+3输出的CC和SS格式下的部分结果和HH格式下的完整结果,并对不同格式下的数据进行不同处理。在SS格式下,由于四个脉动单元输出的部分结果共同构成完整结果,因而首先利用两个加法器分别对脉动单元4n和脉动单元4n+1、脉动单元4n+2和脉动单元4n+3输出的部分结果进行相加。脉动单元4n+1中的数据被左移4位之后,与脉动单元4n中的数据相加,得到计算结果ss1,脉动单元4n+3的数据被左移4位之后,与脉动单元4n+2相加,得到计算结果ss2。其中,计算结果ss1为低8位计算结果,计算结果ss2为高8位计算结果。计算结果ss1和计算结果ss2在不同的时刻经过选择器,并经过流水线处理之后,对计算结果ss2左移8位之后与计算结果ss1相加。接下来,经过饱和处理以得到最后的完整结果(即,最终计算结果)。SS格式下的完整结果为16比特。在CC格式下,由于两个脉动单元各自输出的部分结果共同构成完整结果,因而可以利用两个加法器分别对脉动单元4n和脉动单元4n+1、脉动单元4n+2和脉动单元4n+3输出的部分结果进行相加并经过饱和处理而得到两个完整的结果。例如,脉动单元4n+1中的数据被左移4位之后,与脉动单元4n中的数据相加,得到计算结果cc1。接下来,经过饱和处理以得到最后的完整结果。该完整结果为CC格式下的完整结果。该完整结果为8比特。脉动单元4n+3中的数据被左移4位之后,与脉动单元4n+2中的数据相加,得到计算结果cc2。接下来,经过饱和处理以得到最后的完整结果。该完整结果同样为CC格式下的完整结果。该完整结果为8比特。在HH格式下脉动单元输出并经过饱和处理后的结果是完整结果,该完整结果为4比特。最终,加和单元在SS格式下将输出一个16比特的完整结果、在CC格式下将输出两个 8比特的完整结果,或在HH格式下将输出四个4比特的完整结果。
加和单元输出的完整结果在时序上是不对齐的,上方的加和单元总是比下方的加和单元提前输出,因而需要使用时序对齐模块130对加和单元输出的结果进行三角化对齐时序。如图11所示,在HH、CC和SS下扩展后的特征图数据进入脉动阵列121的时间不同,其三角化方式也不相同。
具体地,对于HH格式,由于扩展后的特征图数据在脉动阵列121中以一个脉动单元为单位脉动传播,每个加和单元输出四个完整结果,因而时序对齐模块130将每个加和单元输出的四个HH格式下的完整结果依次延迟一个计算周期。对于CC格式,由于扩展后的特征图数据在脉动阵列121中以两个脉动单元为单位脉动传播,每个加和单元输出两个完整结果,因而时序对齐模块130将每个加和单元输出的下两个CC格式下的输出数据较上两个CC格式下的输出数据延迟一个计算周期。而对于SS格式,由于扩展后的特征图数据在脉动阵列121中以四个脉动单元为单位脉动传播,每个加和单元输出一个完整结果,因而时序对齐模块130将下一个加和单元输出的SS格式下的完整结果较上一个加和单元输出的SS格式下的完整结果延迟一个计算周期。
基于以上描述,根据本发明实施例的卷积计算装置100采用脉动阵列进行卷积计算,节约了特征图数据和卷积核数据的载入时间,增加了卷积的计算密度,加快了神经网络推断速度。并且,卷积计算装置100可适用于多种精度,具有较高的灵活性和兼容性。
图12示出了根据本发明的一个实施例的卷积计算方法1200的流程图。卷积计算方法1200可以由上述的卷积计算装置100实现。以下仅对卷积计算方法1200的主要步骤进行描述,进一步的细节可以参照上文。
如图12所示,卷积计算方法1200包括如下步骤:
步骤S1210,获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其中,N为所述特征图数据和所述卷积核数据进行卷积 乘法运算的最小位数,M为正整数;
步骤S1220,对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,以得到运算结果;
步骤S1230,对所述运算结果进行时序对齐,并输出对齐后的运算结果。
本发明实施例所提供的卷积计算方法1200采用脉动运算方式,利用了卷积神经网络中权重共享和特征图共享的特点,载入一次数据可以进行多次计算,并且精度可动态配置。示例性地,该卷积计算方法1200中特征图数据和卷积核数据进行卷积乘法运算的最小位数为5比特,该方法可适用于4比特、8比特或16比特精度,运算时分别扩展为5比特、10比特和20比特进行运算。
在一个实施例中,在步骤S1210中,在获取特征图和卷积核之后,为了方便硬件计算,首先将特征图和卷积核展开以得到所述特征图数据和卷积核数据,从而将卷积运算转化成矩阵乘法。将卷积运算转化为矩阵乘法后,运算时需要的数据可以存放在连续的内存上,从而提升了访问速度。
之后,对获取的特征图数据和卷积核数据进行扩展,以将每个特征图数据和卷积核数据的位数扩展为N的M倍,其中N为特征图数据和卷积核数据进行卷积乘法运算的最小位数。
在一个实施例中,所述扩展的方式为:若特征图数据和卷积核数据的精度低于N位,则对每个特征图数据和每个卷积核数据进行填充,以将每个特征图数据和每个卷积核数据扩展为N位。若特征图数据和卷积核数据的精度高于N位,则将每个特征图数据和每个卷积核数据划分为M个等长度的补充码,并将每个补充码分别进行填充,以将每个所述补充码扩展为N位。
例如,若N等于5,则若特征图数据和卷积核数据为4比特精度,则将其扩展为5比特;若特征图数据和卷积核数据为8比特精度,则将其扩展为10比特;若特征图数据和卷积核数据为16比特精度,则将其扩展为20比特。
进一步地,对补充码进行填充的方式为:若特征图数据和卷积核数据的精度低于N位,则对所述特征图数据和所述卷积核数据进行 符号位填充;若特征图数据和卷积核数据的精度高于N位,则对所述特征图数据和所述卷积核数据的最高一个补充码进行符号位填充,其余M-1个补充码用0进行填充。
例如,对于HH(4比特)、CC(8比特)和SS(16比特)精度的格式,填充方式分别如下:
1)HH格式下,对每个特征图数据(data)和每个卷积核数据(coeff)进行符号位填充,即data[3:0]变为{data[3],data[3:0]},coeff[3:0]变为{coeff[3],coeff[3:0]};
2)CC格式下,对每个特征图数据(data)和每个卷积核数据(coeff)的低4比特进行0填充,高4比特进行符号位填充,即data[7:0]变为{data[7],data[7:4],0,data[3:0]},coeff[7:0]变为{coeff[7],coeff[7:4],0,coeff[3:0]};
3)SS格式下,对每个特征图数据(data)和每个卷积核数据(coeff)的最高一组4比特进行符号位填充,其余三组进行0填充,即data[15:0]变为{data[15],data[15:12],0,data[11:8],0,data[7:4],0,data[3:0]},coeff[15:0]变为{coeff[15],coeff[15:12],0,coeff[11:8],0,coeff[7:4],0,coeff[3:0]}。
经过拆分和扩展以后,各精度的特征图数据和卷积核数据统一扩展为卷积运算最小位数N的M倍,后续可以根据M的数值调用相应数目的脉动单元进行计算,无需将所有精度的数据统一扩展为最高精度,即可以实现多精度数据的兼容,具体参见下文。
接着,在步骤S1220中,基于脉动阵列完成所述脉动运算。脉动阵列利用卷积神经网络中权重共享和特征图共享的特点完成卷积计算,载入一次数据可以进行多次计算,节约了特征图数据和卷积核数据的载入时间,增加了卷积的计算密度,加快了神经网络推断速度。
进一步地,脉动阵列包括多个脉动单元,脉动单元的数目一般设置为4的倍数,例如,脉动单元的数目可以为128个。
在本申请一个实施例中,在脉动阵列中脉动传播的元素为扩展后的特征图数据,而不包括扩展后的卷积核数据,卷积核数据按次序载入每个脉动单元与扩展后的特征图数据进行相乘,而不再被脉动传播。
具体地,将所述扩展后的特征图数据在每一时刻写入到脉动阵列 中最上方的脉动单元,并自最上方的所述脉动单元开始向下脉动传播;将所述扩展后的卷积核数据按时刻依次写入到所述脉动单元。脉动单元对当前时刻接收到的所述扩展后的特征图数据和扩展后的卷积核数据进行点乘操作,并输出点乘结果。
其中,扩展后的卷积核数据以广播的形式送入到每一个脉动单元并在合适的时刻加载,扩展后的特征图数据总是被写入到最上方的一个或多个脉动单元,并向最下方的脉动单元脉动传播,每一个计算周期对当前脉动单元内的扩展后的特征图数据和扩展后的卷积核数据进行一次点乘。如果脉动运算共有128个计算周期,那么每个脉动单元都会输出128个结果,128个脉动单元共输出128×128个结果。如果需要的脉动单元数目小于128,则只有部分脉动单元会被激活,剩余脉动单元不会参与运算。
在一个实施例中,所述脉动单元包括多级流水线,其中,多级流水线中的第一级流水线包括多个乘法器,多级流水线中的第二级至最后一级流水线分别包括若干加法器。脉动单元对所述扩展后的特征图数据和所述扩展后的卷积核数据进行相乘,并对所述相乘的结果依次执行多级相加。
其中,当所述特征图数据和所述卷积核数据的精度低于N位时,第一级流水线中的每个乘法器用于对扩展后的卷积核数据和扩展后的特征图数据执行一一对应相乘。例如,在HH格式下,每个乘法器接收一个5比特的特征图数据和一个5比特的卷积核数据以进行相乘。当所述特征图数据和所述卷积核数据的精度高于N位时,每个乘法器用于对所述扩展后的卷积核数据中的一个N位的补充码和所述扩展后的特征图数据中的一个N位的补充码执行一一对应相乘。例如,在CC格式或SS格式下,每个乘法器接收从扩展后的特征图数据中拆分出的一个5比特的补充码和从扩展后的卷积核数据中拆分出的一个5比特的补充码进行相乘。
如上所述,扩展后的特征图数据和扩展后的卷积核数据被扩展为卷积运算最小位数N的M倍,其中,每个脉动单元用于执行最小位数N的卷积运算,则当所述特征图数据和所述卷积核数据的精度低于N位时,每个单独的脉动单元用于对所述扩展后的特征图数据和 所述扩展后的卷积核数据进行点乘操作,所述点乘结果为完整结果;当所述特征图数据和所述卷积核数据的精度高于N位时,每个单独的脉动单元完成扩展后的特征图数据和扩展后的卷积核数据的一部分点乘操作,并输出部分结果,M个相邻的脉动单元所输出的部分结果共同构成一次点乘操作的完整结果。
例如,对于HH(4比特)格式,则由每个脉动单元完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘,输出完整的点乘结果;对于CC(8比特)格式,则由两个相邻的脉动单元共同完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘;而对于SS(16比特)格式,则由四个相邻的脉动单元共同完成一行扩展后的卷积核数据和一列扩展后的特征图数据的点乘。
具体地,在HH格式下,扩展后的特征图数据在每一时刻总是被送到脉动阵列最上方的脉动单元,并从上方的脉动单元向下方的脉动单元脉动传播,扩展后的卷积核数据按照顺序依次写入脉动单元,每个脉动单元输出的是完整的点乘结果。
而当特征图数据和所述卷积核数据的精度高于N位时,每个脉动单元输出的是部分结果,由两个脉动单元组合在一起才能完成扩展后的特征图数据和扩展后的卷积核数据的点乘操作。
具体地,当特征图数据和卷积核数据的精度高于N位时,将每个扩展后的特征图数据同步写入到最上方的M个脉动单元,并以M个脉动单元为单位向下脉动传播;将每个扩展后的卷积核数据进行交织,以获得M个交织后的卷积核数据,并将所述M个交织后的卷积核数据按时刻写入到所述脉动单元,其中在每一时刻将所述M个交织后的卷积核数据分别写入到M个脉动单元。M个脉动单元中的每个脉动单元对当前时刻写入的所述扩展后的特征图数据和所述交织后的卷积核数据进行点乘操作,并输出所述部分结果。
作为示例,对卷积核数据进行的交织包括:将扩展后的卷积核数据分割为M个N位的补充码;将每个N位的补充码复制M次,以得到交织后的卷积核数据。通过上述运算方式,可以使M个脉动单元输出的部分结果构成完整结果。
例如,在CC格式下,分别将8比特的特征图数据和8比特的卷 积核数据扩展为10比特的扩展后的特征图数据D和10比特的扩展后的卷积核数据C,扩展后的特征图数据D被拆分为2个5比特的数据D[9:5]和D[4:0],扩展后的卷积核数据C被拆分为2个5比特的数据C[9:5]和C[4:0],则D×C=(2 4×D[9:5]+D[4:0])×(2 4×C[9:5]+C[4:0])=2 4×C[9:5]×(2 4×D[9:5]+D[4:0])+C[4:0]×(2 4×D[9:5]+D[4:0]);
其中,C[9:5]×(2 4×D[9:5]+D[4:0])由上方的脉动单元计算以得到第一部分结果,C[4:0]×(2 4×D[9:5]+D[4:0])由下方的脉动单元计算以得到第二部分结果,之后还需对第一部分结果和第二部分结果进行相加,以得到完整结果。
至于SS格式下,需要综合4个脉动单元输出的部分结果才能得到扩展后的特征图数据和扩展后的卷积核数据的完整点乘结果。具体地,在每一时刻将扩展后的特征图数据写入到相邻的4个脉动单元,并以4个脉动单元为单位向下传播,同时对扩展后的卷积核数据进行交织,并按顺序写入到不同组脉动单元中的四个脉动单元。每组的四个脉动单元输出的部分结果共同构成完整结果。
综上,在HH格式下,每个脉动单元输出的是扩展后的特征图数据与扩展后的卷积核数据进行点乘的完整结果,但在CC、SS格式下,每个脉动单元输出的仅仅是扩展后的特征图数据与扩展后的卷积核数据进行点乘的部分结果,还需要对部分结果进行一次压缩和饱和处理,将部分结果组合构成完整结果。
由于加和所得的完整结果在时序上是不对齐的,因而需要对加和单元输出的结果进行三角化对齐时序。时序对齐的具体方式可以参照上文,在此不做赘述。
基于以上描述,根据本发明实施例的卷积计算方法1200采用脉动阵列进行卷积计算,节约了特征图数据和卷积核数据的载入时间,增加了卷积的计算密度,加快了神经网络推断速度。并且,卷积计算方法1200可适用于多种精度,具有较高的灵活性和兼容性。
另外,本发明实施例还提供了一种计算机存储介质,其上存储有计算机程序。当所述计算机程序由处理器执行时,可以实现前述图12所示的卷积计算方法1200的步骤。例如,该计算机存储介质为计 算机可读存储介质。计算机存储介质例如可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、或者上述存储介质的任意组合。计算机可读存储介质可以是一个或多个计算机可读存储介质的任意组合。
在一个实施例中,计算机存储介质上存储的计算机程序指令在被计算机或处理器运行时使计算机或处理器执行以下步骤:
获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,以得到运算结果;
对所述运算结果进行时序对齐,并输出对齐后的运算结果。
另外,本发明实施例还提供了一种计算机程序产品,其包含指令,当该指令被计算机所执行时,使得计算机执行上述图12所示的卷积计算方法1200的方法的步骤。
综上所述,本发明实施例的卷积计算方法、卷积计算装置和计算机存储介质采用脉动阵列进行卷积计算,节约了特征图数据和卷积核数据的载入时间,增加了卷积的计算密度,加快了神经网络推断速度。并且,上述卷积计算装置和方法可适用于多种精度,具有较高的灵活性和兼容性。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其他任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以 从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc,DVD))、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单 元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。
尽管这里已经参考附图描述了示例实施例,应理解上述示例实施例仅仅是示例性的,并且不意图将本发明的范围限制于此。本领域普通技术人员可以在其中进行各种改变和修改,而不偏离本发明的范围和精神。所有这些改变和修改意在被包括在所附权利要求所要求的本发明的范围之内。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本发明并帮助理解各个发明方面中的一个或多个,在对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该本发明的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如相应的权利要求书所反映的那样,其发明点在于可以用少于某个公开的单个实施例的所有特征的特征来解决相应的技术问题。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
本领域的技术人员可以理解,除了特征之间相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的替代特征来代替。
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多 个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
以上所述,仅为本发明的具体实施方式或对具体实施方式的说明,本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。本发明的保护范围应以权利要求的保护范围为准。

Claims (35)

  1. 一种卷积计算装置,其特征在于,所述卷积计算装置包括:
    扩展模块,用于获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其中N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
    脉动运算模块,用于对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,并输出运算结果;
    时序对齐模块,用于对所述脉动运算模块输出的所述运算结果进行时序对齐,并输出对齐后的运算结果。
  2. 根据权利要求1所述的卷积计算装置,其特征在于,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度低于N位,则对每个所述特征图数据和每个所述卷积核数据进行填充,以将每个所述特征图数据和每个所述卷积核数据扩展为N位;
    若所述特征图数据和所述卷积核数据的精度高于N位,则将每个所述特征图数据和每个所述卷积核数据划分为M个等长度的补充码,并将每个所述补充码分别进行填充,以将每个所述补充码扩展为N位。
  3. 根据权利要求2所述的卷积计算装置,其特征在于,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度低于N位,则对所述特征图数据和所述卷积核数据进行符号位填充;
    若所述特征图数据和所述卷积核数据的精度高于N位,则对所述特征图数据和所述卷积核数据的最高一个补充码进行符号位填充,其余M-1个补充码用0进行填充。
  4. 根据权利要求1-3之一所述的卷积计算装置,其特征在于,所述特征图数据和所述卷积核数据的精度包括4位、8位或16位中 的一种,M的值为5,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度为4位,则将所述特征图数据和所述卷积核数据扩展为5位;
    若所述特征图数据和所述卷积核数据的精度为8位,则将所述特征图数据和所述卷积核数据扩展为2个5位的补充码;
    若所述特征图数据和所述卷积核数据的精度为16位,则将所述特征图数据和所述卷积核数据扩展为4个5位的补充码。
  5. 根据权利要求1所述的卷积计算装置,其特征在于,所述脉动运算模块包括脉动阵列,所述脉动阵列包括多个脉动单元,所述脉动运算包括:
    将所述扩展后的特征图数据在每一时刻写入到最上方的所述脉动单元,并自最上方的所述脉动单元开始向下脉动传播;
    将所述扩展后的卷积核数据按时刻依次写入所述脉动单元;
    所述脉动单元用于对当前时刻接收到的所述扩展后的特征图数据和扩展后的卷积核数据进行点乘操作,并输出点乘结果。
  6. 根据权利要求5所述的卷积计算装置,其特征在于,所述扩展后的卷积核采用广播的形式送入所述脉动单元。
  7. 根据权利要求5所述的卷积计算装置,其特征在于,当所述特征图数据和所述卷积核数据的精度低于N位时,每个单独的所述脉动单元用于对所述扩展后的特征图数据和所述扩展后的卷积核数据进行点乘操作,所述点乘结果为完整结果。
  8. 根据权利要求5所述的卷积计算装置,其特征在于,当所述特征图数据和所述卷积核数据的精度高于N位时,每个单独的所述脉动单元完成所述扩展后的特征图数据和所述扩展后的卷积核数据的一部分点乘操作,并输出部分结果,M个相邻的所述脉动单元所输出的所述部分结果共同构成一次点乘操作的完整结果。
  9. 根据权利要求8所述的卷积计算装置,其特征在于,所述每个单独的所述脉动单元完成所述扩展后的特征图数据和所述扩展后的卷积核数据的一部分点乘操作,包括:
    将每个扩展后的特征图数据同步写入到最上方的M个脉动单元, 并以M个脉动单元为单位向下脉动传播;
    将每个扩展后的卷积核数据进行交织,以获得M个交织后的卷积核数据;
    将所述M个交织后的卷积核数据按时刻写入到所述脉动单元,其中在每一时刻将所述M个交织后的卷积核数据分别写入到M个脉动单元;
    所述M个脉动单元中的每个脉动单元用于对当前时刻写入的所述扩展后的特征图数据和所述交织后的卷积核数据进行点乘操作,并输出所述部分结果。
  10. 根据权利要求9所述的卷积计算装置,其特征在于,所述交织包括:
    将所述扩展后的卷积核数据分割为M个N位的补充码;
    将每个所述N位的补充码复制M次,以得到所述交织后的卷积核数据。
  11. 根据权利要求5-10之一所述的卷积计算装置,其特征在于,所述脉动阵列中包括128个所述脉动单元。
  12. 根据权利要求5-11之一所述的卷积计算装置,其特征在于,所述脉动单元包括多级流水线,其中,所述多级流水线中的第一级流水线包括多个乘法器,所述多级流水线中的第二级至最后一级流水线分别包括若干加法器,每个所述加法器接收上一级流水线的两个输出数据并进行相加。
  13. 根据权利要求12所述的卷积计算装置,其特征在于,当所述特征图数据和所述卷积核数据的精度低于N位时,每个所述乘法器用于对所述扩展后的卷积核数据和所述扩展后的特征图数据执行一一对应相乘。
  14. 根据权利要求12所述的卷积计算装置,其特征在于,当所述特征图数据和所述卷积核数据的精度高于N位时,每个所述乘法器用于对所述扩展后的卷积核数据中的一个N位的补充码和所述扩展后的特征图数据中的一个N位的补充码执行一一对应相乘。
  15. 根据权利要求12-14之一所述的卷积计算装置,其特征在于,所述脉动单元包括8级流水线,其中,第一级流水线包括128个5位 有符号乘法器,第二至第八级流水线分别包括64、32、16、8、4、2、1个加法器。
  16. 根据权利要求14所述的卷积计算装置,其特征在于,当所述卷积核数据和所述特征图数据的精度为8位时,第二级流水线中每个所述加法器左侧的输入左移4位;当所述卷积核数据和所述特征图数据的精度为16位时,第二级流水线中每个所述加法器的左侧的输入左移4位、第三级流水线中每个所述加法器左侧的输入左移8位。
  17. 根据权利要求8-10之一所述的卷积计算装置,其特征在于,所述脉动计算模块还包括加和子模块,所述加和子模块用于对所述脉动单元输出的部分结果进行压缩,以得到完整结果。
  18. 根据权利要求17所述的卷积计算装置,其特征在于,所述脉动阵列包括128个脉动单元,所述加和子模块包括32个加和单元,每个所述加和单元用于接收4个所述脉动单元各自输出的所述部分结果。
  19. 一种卷积计算方法,其特征在于,所述卷积计算方法包括:
    获取特征图数据和卷积核数据,并根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,以使每个所述特征图数据和所述卷积核数据的位数扩展为N的M倍,其N为所述特征图数据和所述卷积核数据进行卷积乘法运算的最小位数,M为正整数;
    对扩展后的卷积核数据和扩展后的特征图数据进行脉动运算,以得到运算结果;
    对所述运算结果进行时序对齐,并输出对齐后的运算结果。
  20. 根据权利要求19所述的卷积计算方法,其特征在于,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度低于N位,则对所述特征图数据和所述卷积核数据进行填充,以将所述特征图数据和所述卷积核数据扩展为N位;
    若所述特征图数据和所述卷积核数据的精度高于N位,则将所述特征图数据和所述卷积核数据划分为M个等长度的补充码,并将 每个所述补充码分别进行填充,以将每个所述补充码扩展为N位。
  21. 根据权利要求20所述的卷积计算方法,其特征在于,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度低于N位,则对所述特征图数据和所述卷积核数据进行符号位填充;
    若所述特征图数据和所述卷积核数据的精度高于N位,则对所述特征图数据和所述卷积核数据的最高一个补充码进行符号位填充,其余M-1个补充码用0进行填充。
  22. 根据权利要求19-21之一所述的卷积计算方法,其特征在于,所述特征图数据和所述卷积核数据的精度包括4位、8位或16位,M的值为5,所述根据所述特征图数据和所述卷积核数据的精度对所述特征图数据和所述卷积核数据进行扩展,包括:
    若所述特征图数据和所述卷积核数据的精度为4位,则将所述特征图数据和所述卷积核数据扩展为5位;
    若所述特征图数据和所述卷积核数据的精度为8位,则将所述特征图数据和所述卷积核数据扩展为10位;
    若所述特征图数据和所述卷积核数据的精度为16位,则将所述特征图数据和所述卷积核数据扩展为20位。
  23. 根据权利要求19所述的卷积计算方法,其特征在于,所述脉动运算包括:
    将所述扩展后的特征图数据在每一时刻写入到脉动阵列中最上方的脉动单元,并自最上方的所述脉动单元开始向下脉动传播;
    将所述扩展后的卷积核数据按时刻依次写入到所述脉动单元;
    由所述脉动单元对当前时刻接收到的所述扩展后的特征图数据和扩展后的卷积核数据进行点乘操作,并输出点乘结果。
  24. 根据权利要求23所述的卷积计算方法,其特征在于,当所述特征图数据和所述卷积核数据的精度低于N位时,每个单独的所述脉动单元用于对所述扩展后的特征图数据和所述扩展后的卷积核数据进行点乘操作,所述点乘结果为完整结果。
  25. 根据权利要求23所述的卷积计算方法,其特征在于,当所 述特征图数据和所述卷积核数据的精度高于N位时,每个单独的所述脉动单元完成所述扩展后的特征图数据和所述扩展后的卷积核数据的一部分点乘操作,并输出部分结果,M个相邻的所述脉动单元共同完成所述扩展后的特征图数据和所述扩展后的卷积核数据的一次点乘操作。
  26. 根据权利要求25所述的卷积计算方法,其特征在于,所述每个单独的所述脉动单元完成所述扩展后的特征图数据和所述扩展后的卷积核数据的一部分点乘操作,包括:
    将每个扩展后的特征图数据同步写入到最上方的M个脉动单元,并以M个脉动单元为单位向下脉动传播;
    将每个扩展后的卷积核数据进行交织,以获得M个交织后的卷积核数据;
    将所述M个交织后的卷积核数据按时刻写入到所述脉动单元,其中在每一时刻所述M个交织后的卷积核数据分别写入到M个脉动单元;
    所述M个脉动单元中的每个脉动单元对当前时刻写入的所述扩展后的特征图数据和所述交织后的卷积核数据进行点乘操作,并输出所述部分结果。
  27. 根据权利要求26所述的卷积计算方法,其特征在于,所述扩展后的卷积核采用广播的形式送入所述脉动单元。
  28. 根据权利要求26所述的卷积计算方法,其特征在于,所述交织包括:
    将所述扩展后的卷积核数据分割为M个N位的补充码;
    将每个所述M位的补充码复制M次。
  29. 根据权利要求23-28之一所述的卷积计算方法,其特征在于,所述点乘操作包括:
    对所述扩展后的特征图数据和所述扩展后的卷积核数据进行相乘;
    对所述相乘的结果依次执行多级相加。
  30. 根据权利要求29所述的卷积计算方法,其特征在于,当所述特征图数据和所述卷积核数据的精度低于N位时,所述相乘为对 所述扩展后的卷积核数据和所述扩展后的特征图数据执行的一一对应相乘。
  31. 根据权利要求29所述的卷积计算方法,其特征在于,当所述特征图数据和所述卷积核数据的精度高于N位时,所述相乘为对所述扩展后的卷积核数据中的一个N位的补充码和所述扩展后的特征图数据中的一个N位的补充码执行的一一对应相乘。
  32. 根据权利要求29-30之一所述的卷积计算方法,其特征在于,所述多级相加为七级相加。
  33. 根据权利要求32所述的卷积计算方法,其特征在于,当所述卷积核数据和所述特征图数据的精度为8位时,第一级相加时左侧的输入左移4位;当所述卷积核数据和所述特征图数据的精度为16位时,第二级相加时左侧的输入左移4位,第二级相加时左侧的输入左移8位。
  34. 根据权利要求25-28之一所述的卷积计算方法,其特征在于,所述方法还包括:对所述脉动单元输出的所述部分结果进行压缩,以得到完整结果。
  35. 一种计算机存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求19至34中任一项所述方法的步骤。
PCT/CN2020/089570 2020-05-11 2020-05-11 卷积计算装置、方法和计算机存储介质 WO2021226782A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080006263.7A CN113168429A (zh) 2020-05-11 2020-05-11 卷积计算装置、方法和计算机存储介质
PCT/CN2020/089570 WO2021226782A1 (zh) 2020-05-11 2020-05-11 卷积计算装置、方法和计算机存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/089570 WO2021226782A1 (zh) 2020-05-11 2020-05-11 卷积计算装置、方法和计算机存储介质

Publications (1)

Publication Number Publication Date
WO2021226782A1 true WO2021226782A1 (zh) 2021-11-18

Family

ID=76879254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089570 WO2021226782A1 (zh) 2020-05-11 2020-05-11 卷积计算装置、方法和计算机存储介质

Country Status (2)

Country Link
CN (1) CN113168429A (zh)
WO (1) WO2021226782A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114237551A (zh) * 2021-11-26 2022-03-25 南方科技大学 一种基于脉动阵列的多精度加速器及其数据处理方法
CN116781484A (zh) * 2023-08-25 2023-09-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897147B (zh) * 2022-05-18 2023-06-06 北京百度网讯科技有限公司 骨干网络的生成方法、装置、设备以及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647776A (zh) * 2018-05-08 2018-10-12 济南浪潮高新科技投资发展有限公司 一种卷积神经网络卷积膨胀处理电路及方法
CN109934339A (zh) * 2019-03-06 2019-06-25 东南大学 一种基于一维脉动阵列的通用卷积神经网络加速器
CN110163338A (zh) * 2019-01-31 2019-08-23 腾讯科技(深圳)有限公司 具有运算阵列的芯片运算方法、装置、终端及芯片
CN110543934A (zh) * 2019-08-14 2019-12-06 北京航空航天大学 一种用于卷积神经网络的脉动阵列计算结构及方法
US20200026497A1 (en) * 2018-07-23 2020-01-23 SK Hynix Inc. Computation circuit, computation device and system including the same

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647776A (zh) * 2018-05-08 2018-10-12 济南浪潮高新科技投资发展有限公司 一种卷积神经网络卷积膨胀处理电路及方法
US20200026497A1 (en) * 2018-07-23 2020-01-23 SK Hynix Inc. Computation circuit, computation device and system including the same
CN110163338A (zh) * 2019-01-31 2019-08-23 腾讯科技(深圳)有限公司 具有运算阵列的芯片运算方法、装置、终端及芯片
CN109934339A (zh) * 2019-03-06 2019-06-25 东南大学 一种基于一维脉动阵列的通用卷积神经网络加速器
CN110543934A (zh) * 2019-08-14 2019-12-06 北京航空航天大学 一种用于卷积神经网络的脉动阵列计算结构及方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114237551A (zh) * 2021-11-26 2022-03-25 南方科技大学 一种基于脉动阵列的多精度加速器及其数据处理方法
CN116781484A (zh) * 2023-08-25 2023-09-19 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质
CN116781484B (zh) * 2023-08-25 2023-11-07 腾讯科技(深圳)有限公司 数据处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113168429A (zh) 2021-07-23

Similar Documents

Publication Publication Date Title
WO2021226782A1 (zh) 卷积计算装置、方法和计算机存储介质
US20210124795A1 (en) Performing matrix multiplication in hardware
CN107340993B (zh) 运算装置和方法
US6609140B1 (en) Methods and apparatus for fast fourier transforms
CN110413254B (zh) 数据处理器、方法、芯片及电子设备
CN110515589B (zh) 乘法器、数据处理方法、芯片及电子设备
US7543008B1 (en) Apparatus and method for providing higher radix redundant digit lookup tables for recoding and compressing function values
CN110362293B (zh) 乘法器、数据处理方法、芯片及电子设备
CN111045728B (zh) 一种计算装置及相关产品
CN113076083B (zh) 数据乘加运算电路
CN111353598A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质
CN115238863A (zh) 一种卷积神经网络卷积层的硬件加速方法、系统及应用
CN110515587B (zh) 乘法器、数据处理方法、芯片及电子设备
CN112256236A (zh) 一种基于近似定常数复数乘法器的fft电路及实现方法
CN111930681A (zh) 一种计算装置及相关产品
CN116205244B (zh) 一种数字信号处理结构
US11604973B1 (en) Replication of neural network layers
WO2021168644A1 (zh) 数据处理装置、电子设备和数据处理方法
JP2677969B2 (ja) 直交変換装置
WO2023116400A1 (zh) 向量运算方法、向量运算器、电子设备和存储介质
CN111258544B (zh) 乘法器、数据处理方法、芯片及电子设备
US20220035890A1 (en) Time Domain Unrolling Sparse Matrix Multiplication System and Method
CN110647307B (zh) 数据处理器、方法、芯片及电子设备
CN113031916A (zh) 乘法器、数据处理方法、装置及芯片
CN111382835A (zh) 一种神经网络压缩方法、电子设备及计算机可读介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20935027

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20935027

Country of ref document: EP

Kind code of ref document: A1