WO2023030061A1 - 卷积运算电路及方法、神经网络加速器和电子设备 - Google Patents

卷积运算电路及方法、神经网络加速器和电子设备 Download PDF

Info

Publication number
WO2023030061A1
WO2023030061A1 PCT/CN2022/113849 CN2022113849W WO2023030061A1 WO 2023030061 A1 WO2023030061 A1 WO 2023030061A1 CN 2022113849 W CN2022113849 W CN 2022113849W WO 2023030061 A1 WO2023030061 A1 WO 2023030061A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
sub
multiply
zero
register
Prior art date
Application number
PCT/CN2022/113849
Other languages
English (en)
French (fr)
Inventor
孙亚锋
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2023030061A1 publication Critical patent/WO2023030061A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of electronic equipment, in particular to a convolution operation circuit and method, a convolutional neural network accelerator, electronic equipment, computer storage media and computer program products.
  • Convolutional Neural Networks is a type of Feedforward Neural Networks (Feedforward Neural Networks) that includes convolution calculations and has a deep structure. It is one of the representative algorithms for deep learning.
  • the convolution operation is generally done by laying out the Multiply Accumulate (MAC) array to complete the multiply-accumulate operation.
  • MAC Multiply Accumulate
  • the power consumption of the convolution operation based on the MAC array in the convolutional neural network is relatively high.
  • a convolution operation circuit and method a convolutional neural network accelerator, electronic equipment, a computer storage medium, and a computer program product are provided.
  • the embodiment of the present application provides a convolution operation circuit, including:
  • a storage module including at least one register
  • a processing unit configured to perform segmentation processing on the original convolution kernel to obtain at least one sub-convolution kernel, obtain the weight type of the weight plane corresponding to each channel in each of the sub-convolution kernels, and perform corresponding processing according to the weight type
  • the register is configured; the weight type is used to characterize the distribution mode of the zero-value weight in each of the weight planes;
  • the multiply-accumulate array is respectively connected to the processing unit and the register, and is used to respond to the configuration information of the configured register, and to compare the non-zero weights in each of the weight planes in the sub-convolution kernel with the weights to be convoluted
  • the data is multiplied and accumulated.
  • the above-mentioned convolution operation circuit includes a processing unit, a storage module and a multiply-accumulate array, wherein the processing unit can perform segmentation processing on the original convolution kernel to obtain at least one sub-convolution kernel, and obtain each of the sub-convolution kernels
  • the weight type of the weight plane corresponding to the channel and output the corresponding configuration information according to the weight type to configure each register in the storage module, so that the multiply-accumulate array can configure the sub-convolution kernel based on the configuration information of the register
  • the non-zero weights in each of the weight planes are multiplied and accumulated with the data to be convoluted.
  • its multiply-accumulate array can respond to the configuration information of the registers, and only correspondingly read the non Zero weight, instead of reading the zero-valued weight in the weight plane, and multiplying and accumulating the read non-zero weight with the input feature map.
  • the data transfer of zero-value weight elements in the weight plane is reduced, zero-jump processing is realized, and power consumption can be reduced; in addition, it is also possible to avoid using an additional zero-jump circuit in the multiply-accumulate array in the related art to It is determined whether the weight element of the weight element is zero weight, which further reduces the power consumption of the zero jump circuit itself, and further simplifies the structural design of the convolution operation circuit on the premise of saving power consumption.
  • the embodiments of the present application provide a convolution operation method, including:
  • the control multiply-accumulate array responds to the configuration information of the register, and performs multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convoluted.
  • the convolution operation method provided in the present application is applied to the convolution operation circuit provided in the first aspect above, and can also achieve the beneficial effects of the convolution operation circuit provided in the first aspect.
  • a neural network accelerator including:
  • the data storage module is used to store the original convolution kernel and the input element block in the data to be convoluted
  • the convolution operation circuit obtains the original convolution kernel and the input element block through the data storage module.
  • an electronic device including:
  • the neural network accelerator is connected to the system bus.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:
  • the control multiply-accumulate array responds to the configuration information of the register, and performs multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convoluted.
  • the embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • the control multiply-accumulate array responds to the configuration information of the register, and performs multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convoluted.
  • the embodiments of the present application provide a computer program product, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented:
  • the control multiply-accumulate array responds to the configuration information of the register, and performs multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convoluted.
  • the beneficial effects that can be achieved by the neural network accelerator described in the third aspect and the electronic device described in the fourth aspect provided above can be referred to the above-mentioned convolution operation method described in the first aspect and any one of them The beneficial effects of the embodiments are not repeated here.
  • the beneficial effects achieved by the electronic device of the fifth aspect, the computer-readable storage medium of the sixth aspect, and the computer program product of the seventh aspect you can refer to the aforementioned second aspect The beneficial effects of the convolution operation method are not repeated here.
  • Fig. 1 is one of the structural representations of the convolution operation circuit in an embodiment
  • Fig. 2 is the second structural diagram of the convolution operation circuit in an embodiment
  • Fig. 3 is a schematic diagram of segmentation in which the original convolution kernel is switched to a sub-convolution kernel in one embodiment
  • Fig. 4 is a schematic diagram of an input feature map and a plurality of sub-convolution kernel convolution operations in one embodiment
  • Fig. 5 is the third structural diagram of the convolution operation circuit in an embodiment
  • FIG. 6 is a schematic diagram of a convolution operation of a convolution operation circuit in an embodiment
  • FIG. 7 is the fourth schematic diagram of the structure of the convolution operation circuit in an embodiment
  • Fig. 8 is the fifth structural diagram of the convolution operation circuit in an embodiment
  • FIG. 9 is a schematic flowchart of a convolution operation method in an embodiment
  • Fig. 10 is an example of controlling the multiply-accumulate array to respond to the configuration information of the register in one embodiment, and perform multiply-accumulate processing on the non-zero weights in each of the weight planes in the sub-convolution kernel and the data to be convolved Schematic diagram of the process;
  • Fig. 11 is a schematic flowchart of a convolution operation method in another embodiment
  • Fig. 12 is a structural block diagram of a neural network accelerator in an embodiment
  • Fig. 13 is a structural block diagram of an electronic device in one embodiment.
  • first and second used in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, the features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
  • the terms “first”, “second”, etc. may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another element.
  • “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.
  • severeal means at least one, such as one, two, etc., unless otherwise specifically defined.
  • a MAC array is selected to perform convolution processing on the convolution kernel and the input feature map, for example, multiply-accumulate processing.
  • each MAC unit in the MAC array will add a hardware zero-jump circuit to perform zero-jump processing on the weights in the sub-convolution kernel and the data in the input feature map to skip zero-value weights or zero-value
  • the multiply-accumulate operation of the MAC unit of the input data avoids the flipping of the MAC unit.
  • the hardware zero-jump circuit of each MAC unit also generates power consumption during operation, which increases the overall power consumption of the multiply-accumulate operation.
  • the present invention provides a convolution operation circuit.
  • its multiply-accumulate array only corresponds to reading the non-zero weights in each weight plane in each sub-convolution kernel, and does not read Take the zero-value weight in the weight plane, and multiply and accumulate the read non-zero weight with the input feature map.
  • a convolution operation circuit includes a processing unit 110, a storage module 120, and a Multiply Accumulate (MAC) array 130, referred to as a MAC array.
  • the storage module 120 includes at least one register 121 .
  • the processing unit 110 is configured to perform segmentation processing on the original convolution kernel 20 to obtain multiple sub-convolution kernels 210 .
  • the processing unit 110 may obtain the original convolution kernel 20 from the pruned convolutional neural network, or directly receive the convolution kernel data sent by other devices and perform pruning to obtain the original convolution kernel 20 .
  • the original convolution kernel 20 can also be understood as a convolution kernel used for convolution operation with the data to be convoluted
  • the sub-convolution kernel 210 can be understood as a convolution kernel obtained by performing segmentation processing on the original convolution kernel 20 .
  • the original convolution kernel 20 and the sub-convolution kernel 210 may be two-dimensional convolution kernels or three-dimensional convolution kernels.
  • the sub-convolution kernel 210 may include multiple channels, and each channel corresponds to a weight plane 211 .
  • Each sub-convolution kernel 210 may include multiple weight planes 211 , and each weight plane 211 may include multiple weight elements.
  • Each weight element in the weight plane 211 may be stored in the data storage module 120 in the form of a plane.
  • the size of the original convolution kernel is R ⁇ S ⁇ CH
  • the size of the sub-convolution kernel is R ⁇ S ⁇ C as an example.
  • CH is the total number of channels of the original sub-convolution kernel
  • C is the number of channels of the self-convolution kernel
  • R and S are the number of weight elements in the width direction and the number of weight elements in the height direction in each weight plane, respectively. Further, R and S can be equal.
  • the first weight plane 211 includes four weight elements, for example, (w00, w01, w02, w03). If the weight value of the weight element is 0, the weight element is a zero-value weight, and if the weight value of the weight element is non-zero, then the weight element is a non-zero weight.
  • the processing unit 110 can identify the zero-value weight and the non-zero weight of each weight element in the weight plane 211 , and correspondingly obtain the weight type of the weight plane 211 according to the identification result. Wherein, the weight type is used to characterize the distribution mode of the zero-value weight in each weight plane 211 .
  • the processing unit 110 may also configure the corresponding register 121 according to the weight type of each weight plane 211 . After the registers 121 are configured, each register 121 has corresponding configuration information.
  • the multiply-accumulate array 130 is connected to the processing unit 110 and the storage module 120 respectively, and is used for responding to the configuration information of the configured register 121 for non-zero values in each weight plane 211 in the sub-convolution kernel 210
  • the weight is multiplied and accumulated with the data to be convolved 30 , wherein the number of channels of the data to be convolved 30 is the same as the number of channels of the sub-convolution kernel 210 .
  • the data to be convolved 30 can be understood as an input feature map.
  • the data to be convolved 30 may also include multiple channels, and each channel may correspond to an input plane 310 , that is, a two-dimensional image.
  • the input feature map can be understood as a three-dimensional feature map in which two-dimensional images of multiple channels are stacked together, and its depth is equal to the number of channels.
  • the number C of channels of the data to be convolved 30 is equal to the number C of channels in the sub-convolution kernel 210 .
  • the size of the data 30 to be convolved is H ⁇ W ⁇ C as an example for description. That is, each input plane 310 may have a size of W ⁇ H, where W and H indicate the corresponding numbers of input elements in the width direction and height direction of the input plane 310 .
  • the output feature map can be generated by performing a convolution operation between the sub-convolution kernel and the input feature map.
  • the data of the sub-convolution kernel, input feature map and output feature map can be stored in the memory in a planar form.
  • the convolution operation is explained to output the feature map , that is, a two-dimensional image.
  • the number of sub-convolution kernels is equal to the number of channels of the output feature map, that is, the output feature map may include four output planes, and the size of each output plane is the same as the size of the input plane. That is to say, after the input feature map is convolved with a sub-convolution kernel, a two-dimensional image is obtained.
  • the multiply-accumulate array 130 can respond to the configuration information of the registers 121, and according to the configuration information of each register 121, determine whether the weight elements in each weight plane 211 are zero, and according to the determination result, only Correspondingly read the non-zero weights in each weight plane 211 in each sub-convolution kernel 210, do not read the zero-value weights in the weight plane 211, and multiply the read non-zero weights with the input elements corresponding to the input feature map Cumulative processing.
  • the multiply-accumulate array may respond to the configuration information of the register 121 for storing the weight plane 211, and only read the The weight elements w00, w01 do not read the zero-value weights corresponding to w02, w03, and then perform multiplication and addition operations based on the read weight elements w00, w01 and the input elements in the input plane 310 of the corresponding input feature map.
  • the above-mentioned convolution operation circuit includes a processing unit 110, a storage module 120 and a multiply-accumulate array 130, wherein the processing unit 110 can perform segmentation processing on the original convolution kernel 20 to obtain at least one sub-convolution kernel 210, and obtain each sub-convolution kernel 210
  • the weight type of the weight plane 211 corresponding to each channel in the core 210 and configure the registers 121 in the storage module 120 according to the weight type, so that the multiply-accumulate array 130 can configure the sub-convolution kernel based on the configuration information of the register 121
  • the non-zero weights in each weight plane 211 in 210 are multiplied and accumulated with the data to be convoluted 30 .
  • the multiply-accumulate array 130 can respond to the configuration information of the registers 121, and only correspondingly read the values in each weight plane 211 in each sub-convolution kernel 210 according to the configuration information of each register 121.
  • Non-zero weights instead of reading the zero-value weights in the weight plane 211, and performing multiplication and accumulation processing on the read non-zero weights and the input feature map.
  • the processing unit 110 may also obtain position information of zero weights in each weight plane 211 in each sub-convolution kernel 210, and determine the weight type of the weight plane 211 according to the position information.
  • the processing unit 110 may preset the mapping relationship between the position information of the zero-valued weight and the weight type in the weight plane 211 .
  • a weight plane 211 including four weight elements w00, w01, w02, and w03 is taken as an example for description.
  • the weight element of the weight plane 211 is (1,3,0,0)
  • it can be determined that the position of the zero value weight is at the third position and the fourth position, and it can be determined that the weight type of the weight plane 211 is the first weight type.
  • the weight element of the sub-convolution kernel 210 is (1,0,5,0) it can be determined that the position of the zero weight is at the second position and the fourth position, and it can be determined that the weight type of the weight plane 211 is the second weight type.
  • the weight element of the sub-convolution kernel 210 is (0, 0, 5, 2), it can be determined that the position of the zero weight is at the first position and the second position, and it can be determined that the weight type of the weight plane 211 is the third weight type. If the weight element of the sub-convolution kernel 210 is (0, 3, 0, 8), it can be determined that the positions of the zero-value weights are at the first position and the third position, and it can be determined that the weight type of the weight plane 211 is the fourth weight type, etc. wait.
  • the determination of the weight type of the weight plane 211 is associated with the position and quantity of zero-value weights, and is not limited to the above examples.
  • the determination of the weight type of the weight plane 211 may also determine its weight type according to the number of weight elements in the weight plane 211 and the distribution pattern of zero-value weights in the weight elements. For example, when the weight plane 211 includes nine weight elements w00, w01, w02, .
  • the multiply-accumulate array 130 includes sub-multiply-accumulate arrays 131 with m rows and n columns. Wherein, both m and n are positive integers greater than or equal to 1.
  • the sub-multiply-accumulate array 131 may include four 8-bits ⁇ 8-bit multiply-accumulate units, and each sub-multiply-accumulate array 131 can calculate a maximum number of elements which is 256 vector inner product operations.
  • the number of rows m and the number of columns n of the sub-multiply-accumulate array 131 can be set according to the size of the data to be convolved 30 and the original convolution kernel 20.
  • the value of m is less than or equal to the channel number C of the sub-convolution kernel 210
  • the value of n is less than or equal to the number K of the sub-convolution kernel 210 .
  • the multiply-accumulate array 130 includes 4 rows and 4 columns of sub-multiply-accumulate arrays 131 , and each sub-multiply-accumulate array 131 can be denoted as MAC(i, j), where 1 ⁇ i ⁇ 4, 1 ⁇ j ⁇ 4.
  • the sub-multiply-accumulation array 131 in the i-th row and the j-th column can respectively receive the weight elements in the i-th weight plane 211 in the j-th sub-convolution kernel 210, and the i-th weight element in the data to be convolved 30 input elements of an input plane 310 .
  • the sub-multiply-accumulate array 131 in the i-th row and j-column can determine the weight type of the i-th weight plane 211 according to the configuration information of the register 121 connected to it, so as to determine the zero value weight and the non-zero weight in the i-th weight plane 211 weight, and then read only the non-zero weights in the i-th weight plane 211, so as to perform multiplication and accumulation processing on the non-zero weights and the input elements of the i-th input plane 310.
  • MAC(1,1) can respectively obtain the weight elements (w00, w01, w02, w03) in the first weight plane 211 in the first sub-convolution kernel 210, and the first Input elements (a00, a01, a02, a03) in 1 input plane 310 .
  • MAC(1,1) can be allocated for multiply-accumulate matrix operation with pairs (a00, a01, a02, a03) and (w00, w01, w02, w03).
  • the storage module 120 includes a plurality of registers 121 , and the number of registers 121 is less than or equal to the number of sub-multiply-accumulate arrays 131 .
  • the sub-multiply-accumulate array 131 can be correspondingly connected to one register 121 .
  • the processing unit 110 is further configured to configure the register 121 according to the weight type of the weight plane 211 that the sub-multiply-accumulate array 131 can receive.
  • the configuration information of the register 121 connected to the sub-multiply-accumulate array 131 can be used to characterize the weight type of the weight plane 211 in the sub-convolution kernel 210 that the sub-multiply-accumulate array 131 can obtain.
  • the processing unit 110 may configure the registers 121 according to the weight type, and configure the configuration information of each register 121 correspondingly.
  • the configuration information may be represented by the value of the register 121, and the value of the register 121 may be expressed in binary, octal or hexadecimal. For ease of description, the value of the register 121 is expressed in binary as an example for description.
  • the value of the corresponding register 121 can be configured as 01; if the weight type is patten2(0,0,0,0), the value of the corresponding register 121 can be configured as 10 If the weight type is patten3(00,0,w02,w03), the value of the corresponding register 121 can be configured to be 11. The weight type is patten4(w00,0,w02,0), and the value of the corresponding register 121 can be configured to be 100 . It should be noted that the value of the register 121 may not be limited to the above example, but may also be other values.
  • the processing unit 110 is connected to each sub-multiply-accumulate array 131, and is also used to pre-configure the zero-jump operation mode of the sub-multiply-accumulate array 131, so that the sub-multiply-accumulate array 131 can respond to the configuration information of the register 121, Each weight element in the weight plane 211 is zero-jumped, and then only non-zero weights in the weight plane 211 are read, and multiplication and accumulation processing is performed on the non-zero weights and the input element block. , the processing unit 110 may pre-configure the zero-jump working mode of the sub-multiply-accumulate array 131 according to the weight type of the weight plane 211 .
  • the sub-multiply-accumulate array 131 will not read the zero-value weight in the weight plane 211 during the multiplication-accumulation calculation process, and directly skip the product operation of the zero-value weight.
  • the zero jump working mode corresponds to the zero value weight of the weight plane 211 .
  • the weight type is the first weight type
  • the sub-multiply-accumulation array 131 only reads the non-zero weights on the first and second bits, and only corresponds to the non-zero weights on the first and second bits
  • the data 30 to be convoluted is processed by multiplying and accumulating.
  • the zero jump operation mode configured by the sub-multiply-accumulate array 131 has a mapping relationship with the configuration information of the register 121 connected to the sub-multiply-accumulate array 131, so that the sub-multiply-accumulate array 131 can respond to the configuration information of the register 121, making its sub-multiply-accumulate array 131 works in the corresponding zero-jump mode, and performs zero-jump processing on each weight element in the obtained weight plane 211 to read the non-zero weight in the weight plane 211, and then multiply the non-zero weight by the input element block Cumulative processing.
  • the storage module 120 includes C rows and K columns of registers 121; wherein, the m value is less than or equal to the channel number C of the sub-convolution kernel 210, and the n value is less than or equal to the sub-convolution kernel The number K of 210.
  • the storage module 120 may include 4 ⁇ 4 registers 121 .
  • each register 121 may be respectively recorded as Reg(i, j), where 1 ⁇ i ⁇ 4, 1 ⁇ j ⁇ 4. That is, each sub-multiply-accumulate array 131 is configured with an independent register 121, wherein, the sub-multiply-accumulate array 131 in the i-th row and j-th column can be connected to the register Reg(i, j).
  • the value of the register Reg(i, j) can be used to represent the weight type of the i-th weight plane 211 in the j-th sub-convolution kernel 210 .
  • the sub-multiply-accumulate array MAC(1,1) is used as an example for description. If the weight elements in the first weight plane 211 in the first sub-convolution kernel 210 are (w00, w01, 0, 0), the processing unit 110 may determine that the weight type of the weight plane 211 is the first weight type , and the value of the configuration register Reg(1,1) is 1. The sub-multiply-accumulate array MAC(1,1) can obtain the weight type of the weight plane 211 according to the value of the register Reg(1,1), and determine that the weight elements w02 and w03 are all zero-value weights.
  • the sub-multiply-accumulate array MAC(1,1) only reads the weight elements w00, w01, but not the weight elements w02, w03.
  • the sub-multiply-accumulate array MAC(1,1) also reads Take the input elements (a00, a01, a02, a03) of the first input plane 310, and input elements (a00, a01) corresponding to non-zero weights (w00, w01) in the input elements (a00, a01, a02, a03)
  • the first output plane Y1 is the sum of the calculation results of the sub-multiply-accumulate arrays MAC(1,1), MAC(2,1), MAC(3,1), and MAC(4,1).
  • the storage module 120 includes a plurality of registers 121 , and the number of registers 121 is less than the number of sub-multiply-accumulate arrays 131 .
  • the processing unit 110 is also configured to configure the same register 121 for the same sub-multiply-accumulate array 131 group, wherein the same sub-multiply-accumulate array 131 group receives
  • the weight types of the weight planes 211 are the same, and the same group of sub-multiply-accumulate arrays 131 includes at least two sub-multiply-accumulate arrays 131 . That is, at least two sub-multiply-accumulate arrays 131 can be connected to the same register 121 to realize the coexistence of the registers 121 .
  • each sub-convolution kernel 210 can configure the first register 121 according to the weight type of the first weight plane 211 value, the first register 121 can be connected to each sub-multiply-accumulate array MAC(1,j) located in the first row. That is, each sub-multiply-accumulate array MAC(1,j) in the first row forms a sub-multiply-accumulate array group.
  • the processing unit 110 can To configure the value of the second register 121, the second register 121 can be connected to each sub-multiply-accumulate array MAC(i,1) located in the first column. That is, each sub-multiply-accumulate array MAC(i,1) in the first column forms a sub-multiply-accumulate array group.
  • the processor can also control the convolutional neural network to train the convolution kernel data to generate the original convolution kernel 20 and the sub-convolution kernel 210 with preset weight types.
  • the convolution kernel data can be trained into multiple sub-convolution kernels 210, wherein the weight types of all weight planes 211 in each sub-convolution kernel 210 are the same, and the processing unit 110 can be the same sub-multiplication-accumulation array group
  • Each sub-multiply-accumulate array 131 is configured with one register 121 .
  • the group of sub-multiply-accumulate arrays 131 includes sub-multiply-accumulate arrays 131 located in the same column.
  • each sub-multiply-accumulate array 131 in the same column is connected to the same register 121 .
  • the processing unit 110 can configure the register 121 according to the weight type of each sub-convolution kernel 210, so that the sub-multiply-accumulate array 131 located in the same column can all respond to the value of the register 121, and calculate the received weight plane 211 Zero-jump processing is performed on the weight elements.
  • the convolution kernel data can be trained into multiple sub-convolution kernels 210, wherein the weight types of the k-th weight plane 211 in each sub-convolution kernel 210 are the same, and the processing unit 110 can be the same sub-multiply-accumulation
  • Each sub-multiply-accumulate array 131 of the array 131 group is configured with one register 121 .
  • the group of sub-multiply-accumulate arrays 131 includes sub-multiply-accumulate arrays 131 located in the same row, that is, all sub-multiply-accumulate arrays 131 in the same row are connected to the same register 121 .
  • the processing unit 110 can configure the register 121 according to the weight type of the k-th weight plane 211, so that the sub-multiply-accumulate arrays 131 located in the same row can all respond to the value of the register 121, for the received weight plane 211 Zero-jump processing is performed on the weight elements.
  • the value of the weight type class configuration register 121 of each weight plane 211 in each sub-convolution kernel 210 can be trained, and will be able to receive at least two sub-multiply-accumulation arrays 131 of the same weight type Connect to the same register 121 to realize the coexistence of registers 121.
  • the structure of the convolution operation circuit can be further simplified to save costs .
  • the processing unit 110 is further configured to update the weight type of each weight plane 211 in each sub-convolution kernel 210 to reconfigure each configuration register 121 .
  • the data to be convoluted 30 may be a part of the data to be processed.
  • the data to be convoluted 30 may be at least part of image blocks of the image to be processed. If the current image to be processed may be a foreground image block, the next image to be processed may be a background image block.
  • the processing unit 110 can correspondingly acquire the weight type of each weight plane 211 in the new sub-convolution kernel 210, and reconfigure the values of each register 121 according to the new weight type, and each sub-multiply-accumulate array 131 can respond to The value of the register 121 connected to it performs zero-jump processing on the weight elements of each weight plane 211 in the new sub-convolution kernel 210, so as to read non-zero weights and perform multiplication and accumulation operations with new image blocks.
  • the processing unit 110 can also adjust its original convolution kernel 20 data adaptively, and can obtain the weight of each weight plane 211 in the new sub-convolution kernel 210 type, and reconfigure the value of each register 121 according to the new weight type, and each sub-multiply-accumulate array 131 can respond to the value of the register 121 connected to it, to the weight of each weight plane 211 in the new sub-convolution kernel 210 Elements are zero-jumped to read non-zero weights and perform multiplication and accumulation operations with new image blocks, which can be applied to the feature extraction of multiple groups of different image blocks. In the process of convolution operation, it can also reduce the weight plane 211 The data movement of zero-value weight elements reduces power consumption.
  • a convolution operation method is provided, which can be applied to the convolution operation circuit in any of the above embodiments.
  • the convolution operation method includes step 902 to step 906 .
  • Step 902 Segment the original convolution kernel to obtain multiple sub-convolution kernels, and obtain the weight type of the weight plane corresponding to each channel in each sub-convolution kernel; the weight type is used to represent the zero-value weight in each weight plane distribution pattern.
  • the original convolution kernel can be understood as a convolution kernel used for convolution operation with the data to be convoluted
  • the sub-convolution kernel can be understood as a convolution kernel obtained by segmenting the original convolution kernel.
  • the sub-convolution kernel may include multiple channels, and each channel corresponds to a weight plane.
  • Each sub-convolution kernel may include multiple weight planes, and each weight plane may include multiple weight elements.
  • Each weight element in the weight plane can be stored in the data storage module in the form of a plane. If the weight value of the weight element is 0, the weight element is a zero-value weight, and if the weight value of the weight element is non-zero, then the weight element is a non-zero weight.
  • the processing unit can identify the zero-value weight and the non-zero weight of each weight element in the weight plane, and correspondingly obtain the weight type of the weight plane according to the identification result.
  • the weight type is used to represent the distribution mode of the zero-value weight in each weight plane.
  • Step 904 configure the configuration information of the corresponding registers in the storage module according to the weight type, so that the register corresponds to store the weight type of each weight plane in the sub-convolution kernel.
  • the convolution operation circuit can configure corresponding registers according to the weight type of each weight plane. After the registers are configured, each register has corresponding configuration information.
  • Step 906 controlling the multiply-accumulate array to respond to the register configuration information to perform multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convolved.
  • the data to be convolved can be understood as an input feature map.
  • the data to be convolved may also include multiple channels, and each channel may correspond to an input plane, that is, a two-dimensional image.
  • the input feature map can be understood as a three-dimensional feature map in which two-dimensional images of multiple channels are stacked together, and its depth is equal to the number of channels.
  • the number of channels of the data to be convolved is equal to the number of channels in the sub-convolution kernel.
  • control multiply-accumulate array can respond to the configuration information of the registers, and according to the configuration information of each register, determine whether the weight elements in each weight plane are zero, and according to the determination result, only correspondingly read each
  • the non-zero weight in each weight plane in the sub-convolution kernel does not read the zero-value weight in the weight plane, and performs multiplication and accumulation processing on the read non-zero weight and the input element corresponding to the input feature map.
  • the above convolution operation method includes segmenting the original convolution kernel to obtain multiple sub-convolution kernels, obtaining the weight type of the weight plane corresponding to each channel in each sub-convolution kernel; configuring the corresponding The configuration information of the register; so that the register corresponds to the weight type of each weight plane in the storage sub-convolution kernel; the control multiply-accumulate array responds to the configuration information of the register, and the non-zero weight in each weight plane in the sub-convolution kernel and the weight to be
  • the convolutional data is processed by multiply-accumulate.
  • the multiply-accumulate array can respond to the configuration information of the registers, and according to the configuration information of each register, only correspondingly read the non- Zero weight, instead of reading the zero-value weight in the weight plane, and multiplying and accumulating the read non-zero weight with the input feature map, which can reduce the data movement of the zero-value weight element in the weight plane, and realize the zero-jump processing , can reduce power consumption and improve the efficiency of convolution operation.
  • the convolution operation method of the embodiment of the present application can be applied to various scenarios, for example, image recognition fields such as face recognition and license plate recognition, feature fields such as image feature extraction and speech feature extraction, speech recognition field, natural language processing field, etc., input the image or the image converted from other forms of data into the pre-trained convolutional neural network, and then use the convolutional neural network to perform operations to achieve or classify or identify or feature purpose of extraction.
  • image recognition fields such as face recognition and license plate recognition
  • feature fields such as image feature extraction and speech feature extraction, speech recognition field, natural language processing field, etc.
  • the multiply-accumulate array includes m rows and n columns of sub-multiply-accumulate arrays; wherein, the storage module includes a plurality of registers, and each sub-multiply-accumulate array is connected to a register.
  • the control multiply-accumulate array responds to the configuration information of the register, and performs multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convolved, including:
  • Step 1002 control the sub-multiply-accumulate array to perform zero-jump processing on each zero-value weight in the weight plane according to the configuration information of the register, so as to read non-zero weights in the weight plane.
  • Step 1004 perform multiplication and accumulation processing according to the read non-zero weight and the data to be convolved.
  • the multiply-accumulate array can be controlled to respond to the configuration information of the registers, and according to the configuration information of each register, determine whether the weight elements in each weight plane are zero, and according to the determination result, only correspondingly read each
  • the non-zero weight in each weight plane in the sub-convolution kernel does not read the zero-value weight in the weight plane, and performs multiplication and accumulation processing on the read non-zero weight and the input element corresponding to the input feature map.
  • the multiply-accumulate array may only read the weight element w00 of the weight plane in response to the configuration information of the register used to store the weight plane, w01, instead of reading the zero-value weights corresponding to w02, w03, and then, based on the read weight elements w00, w01 and the input elements in the input plane of the corresponding input feature map, perform multiplication and addition operations.
  • step 1000 is further included, configuring the zero-jump working mode of the sub-multiply-accumulate array according to the configuration information of the register.
  • the sub-multiply-accumulate array will not be read as the zero-value weight in the weight plane during the multiplication-accumulation calculation process, and directly skip the product operation of the zero-value weight.
  • the zero-jump working mode corresponds to the zero-value weight of the weight plane.
  • the weight type is the first weight type
  • the sub-multiply-accumulate array only reads the non-zero weights on the first and second bits, and only the non-zero weights on the first and second bits and the corresponding The data to be convolved is multiplied and accumulated.
  • the zero-jump working mode configured by the sub-multiply-accumulate array has a mapping relationship with the configuration information of the registers connected to the sub-multiply-accumulate array, so that the sub-multiply-accumulate array can respond to the configuration information of the register, so that the sub-multiply-accumulate array can work at the corresponding jump
  • each weight element in the obtained weight plane is zero-jumped to read the non-zero weight in the weight plane, and then the non-zero weight and the input element block can be multiplied and accumulated.
  • obtaining the weight type of the weight plane corresponding to each channel in each sub-convolution kernel includes: obtaining the position information of weight elements in each weight plane in each sub-convolution kernel, and determining the weight plane according to the position information weight type.
  • the embodiment of the present application takes a certain weight plane including four weight elements w00, w01, w02, and w03 as an example for description.
  • the weight element of the weight plane is (1,3,0,0)
  • it can be determined that the position of the zero value weight is at the third position and the fourth position
  • it can be determined that the weight type of the weight plane is the first weight type .
  • the weight element of the sub-convolution kernel is (1,0,5,0)
  • it can be determined that the positions of the zero-value weight are at the second position and the fourth position, and it can be determined that the weight type of the weight plane is the second weight type.
  • the weight element of the sub-convolution kernel is (0,0,5,2), it can be determined that the positions of the zero-value weight are at the first position and the second position, and it can be determined that the weight type of the weight plane is the third weight type. If the weight element of the sub-convolution kernel is (0,3,0,8), it can be determined that the position of the zero-value weight is at the first position and the third position, and it can be determined that the weight type of the weight plane is the fourth weight type and so on. It should be noted that the determination of the weight type of the weight plane is associated with the position and quantity of zero-value weights, and is not limited to the above examples.
  • the determination of the weight type of the weight plane may also determine its weight type according to the number of weight elements in the weight plane and the distribution pattern of zero-value weights in the weight elements. For example, when the weight plane includes nine weight elements such as w00, w01, w02, .
  • the convolution operation method includes step 1102 to step 1110 .
  • Step 1102 Segment the original convolution kernel to obtain multiple sub-convolution kernels, and obtain the weight type of the weight plane corresponding to each channel in each sub-convolution kernel; the weight type is used to represent the zero-value weight in each weight plane distribution pattern.
  • Step 1104 configure the configuration information of the corresponding registers in the storage module according to the weight type; so that the register corresponds to store the weight type of each weight plane in the sub-convolution kernel.
  • Step 1106 controlling the multiply-accumulate array to respond to the register configuration information to perform multiply-accumulate processing on the non-zero weights in each weight plane in the sub-convolution kernel and the data to be convolved.
  • Step 1102-step 1106 corresponds to step 902-step 906 in the foregoing embodiment one by one, and here, the specific steps of step 1102-step 1106 are not repeated here.
  • Step 1108 when changing the data to be convolved, determine whether it is necessary to update the weight type of each weight plane in each sub-convolution kernel.
  • the registers are reconfigured according to the updated weight type.
  • the data to be convoluted may be a part of the data to be processed.
  • the data to be convoluted may be at least some image blocks of the image to be processed.
  • the current image to be processed may be a foreground image block
  • the next image to be processed may be a background image block.
  • the weight type of each weight plane in the new sub-convolution kernel can be correspondingly obtained, and the value of each register can be reconfigured according to the new weight type, and then step 1106 can be repeatedly executed to control the response of each sub-multiplication-accumulation array to the corresponding
  • the value of the register connected with the new sub-convolution kernel performs zero-jump processing on the weight elements of each weight plane in the new sub-convolution kernel, so as to read non-zero weights and perform multiplication and accumulation operations with new image blocks.
  • the convolution operation method in this embodiment if the data to be convolved changes, can also adjust its original convolution kernel data adaptively, and can obtain the weight type of each weight plane in the new sub-convolution kernel, And reconfigure the value of each register according to the new weight type, control each sub-multiplication and accumulation array to respond to the value of the register connected to it, and perform zero-jump processing on the weight elements of each weight plane in the new sub-convolution kernel, so as to Reading non-zero weights and performing multiplication and accumulation operations with new image blocks can be applied to the feature extraction of multiple sets of different image blocks.
  • the convolution operation it can also reduce the data transfer of zero-value weight elements in the weight plane. , reducing power consumption.
  • FIGS. 9-11 may include multiple steps or stages. These steps or stages are not necessarily performed at the same time, but may be performed at different times. These steps or stages The order of execution is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.
  • a neural network processor including a data storage module and the convolution operation circuit in any of the foregoing embodiments.
  • the data storage module stores the original convolution kernel and the data to be convoluted.
  • the convolution operation circuit obtains the original convolution kernel and the data to be convoluted through the data storage module.
  • a neural network accelerator including a data storage module 40 and the convolution operation circuit 10 in any of the foregoing embodiments.
  • the data storage module 40 stores the original convolution kernel and the data to be convolved.
  • the convolution operation circuit 10 obtains the original convolution kernel and the data to be convoluted through the data storage module 40 .
  • the data stored in the data storage module 40 may also be a processing result, or in other words, the data stored in the data storage module 40 is the data processed by the processing unit on the data to be convolved. It should be noted that the data actually stored by the data storage module 240 is not limited thereto, and the data storage module 40 may also store other data.
  • its multiply-accumulate array can respond to the configuration information of the registers, and according to the configuration information of each register, it can only read correspondingly in each weight plane in each sub-convolution kernel instead of reading the zero-value weights in the weight plane, and multiplying and accumulating the read non-zero weights with the input feature map, which can reduce the data movement of the zero-value weight elements in the weight plane, which can not only save
  • the storage space of the data cache module 40 can also reduce access to the data storage module 40, reduce power consumption, and improve the computing efficiency of the neural network processor or neural network accelerator.
  • the convolution operation circuit in any of the above embodiments can also be applied to a neural network processor with a MAC array.
  • the neural network processor can work in the always on mode, thereby satisfying its neural network processing requirements.
  • the design requirement that the overall current of the device is less than 5mA.
  • the convolution operation circuit in any of the above embodiments can also be applied to all neural network accelerators with matrix operation unit as the basic unit such as convolution and matrix multiplication.
  • the convolution operation circuit in any of the above embodiments can also be applied to a neural network accelerator with a systolic array.
  • an electronic device 100 including: a system bus and a neural network accelerator or a neural network processor as in any of the foregoing embodiments.
  • the data storage module 40 and the convolution operation circuit 10 in the neural network accelerator or neural network processor are respectively connected to the system bus.
  • the neural network processor or neural network accelerator in the embodiment of the present application may also be integrated with other processors, memories, etc. into one chip.
  • the electronic device also includes a central processing unit 50 and an external memory 60 connected through a system bus.
  • the central processing unit 50 is used to provide calculation and control capabilities, and support the operation of the entire electronic device.
  • the external memory 60 may include non-volatile storage media and internal memory. Nonvolatile storage media store operating systems and computer programs.
  • the computer program can be executed by a processor to implement a convolution operation method provided in the following embodiments.
  • the electronic device can be any terminal device such as mobile phone, tablet computer, PDA (Personal Digital Assistant, personal digital assistant), POS (Point of Sales, sales terminal), vehicle-mounted computer, wearable device, etc.
  • an electronic device including a memory and a processor, where a computer program is stored in the memory, and the processor implements the convolution operation method in any of the above embodiments when executing the computer program.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the convolution operation method in any of the above-mentioned embodiments is implemented.
  • a computer program product containing instructions which, when run on a computer, causes the computer to execute the convolution operation method in any of the above embodiments.
  • Non-volatile memory may include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory or optical memory, etc.
  • Volatile memory can include Random Access Memory (RAM) or external cache memory.
  • RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM).

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种卷积运算电路,卷积运算电路包括存储模块(120),包括至少一寄存器(121);处理单元(110),用于对原始卷积核(20)进行切分处理得到至少一子卷积核(210),获取各所述子卷积核(210)中每一通道对应的权重平面(211)的权重类型,并根据所述权重类型对相应的所述寄存器(121)进行配置;所述权重类型用于表征各所述权重平面(211)中零值权重的分布模式;乘累加阵列(130),分别与所述处理单元(120)、寄存器(121)连接,用于响应于配置后的所述寄存器(121)的配置信息,对所述子卷积核(210)中各所述权重平面(211)中的非零权重与待卷积数据进行乘累加处理。

Description

卷积运算电路及方法、神经网络加速器和电子设备
相关申请的交叉引用
本申请要求于2021年9月3日提交中国专利局、申请号为2021110307957发明名称为“卷积运算电路及方法、神经网络加速器和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子设备技术领域,特别是涉及一种卷积运算电路及方法、卷神经网络加速器、电子设备、计算机存储介质和计算机程序产品。
背景技术
这里的陈述仅提供与本申请有关的背景信息,而不必然地构成现有示例性技术。
卷积神经网络(Convolutional Neural Networks,CNN)是一类包含卷积计算且具有深度结构的前馈神经网络(Feed forward Neural Networks),是深度学习的代表算法之一。在卷积神经网络中,卷积运算是一般是通过布局乘累加(Multiply Accumulate,MAC)阵列来完成乘累加运算的。但是,在相关技术中,卷积神经网络中的基于MAC阵列的卷积运算的功耗较高。
发明内容
根据本申请的各种实施例,提供一种卷积运算电路及方法、卷神经网络加速器、电子设备、计算机存储介质和计算机程序产品。
第一方面,本申请的实施例提供一种卷积运算电路,包括:
存储模块,包括至少一寄存器;
处理单元,用于对原始卷积核进行切分处理得到至少一子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型,并根据所述权重类型对相应的所述寄存器进行配置;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
乘累加阵列,分别与所述处理单元、寄存器连接,用于响应于配置后的所述寄存器的配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
上述卷积运算电路,包括处理单元、存储模块和乘累加阵列,其中,处理单元可对原始卷积核进行切分处理得到至少一子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型,并根据所述权重类型输出对应的配置信息以对存储模块中的各寄存器进行配置,进而使得乘累加阵列能够基于寄存器的配置信息来对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。这样,在执行该卷积运算过程中,其乘累加阵列可响应于所述寄存器的配置信息,根据各寄存器的配置信息,仅对应读取各子卷积核中各所述权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图进行乘累加处理。这样,减少了对权重平面中零值权重元素的数据搬移,实现了跳零处理,可以减少功耗;另外,还可以避免使用相关技术中在乘累加阵列中增加额外的跳零电路来对接收的权重元素是否零值权重进行判定,进一步减少了跳零电路本身的功耗开销,在节省功耗的前提下,进一步简化了卷积运算电路的结构设计。
第二方面,本申请的实施例提供一种卷积运算方法,包括:
对原始卷积核进行切分处理得到多个子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
根据所述权重类型配置存储模块中相应的寄存器的配置信息;以使所述寄存器对应存储所述子卷积核中每个所述权重平面的权重类型;
控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
本申请提供的卷积运算方法,应用于上述第一方面所提供的卷积运算电路,也能实现第一方面所提供的卷积运算电路所具备的有益效果。
第三方面,本申请的实施例提供一种神经网络加速器,包括:
数据存储模块,用于存储原始卷积核和待卷积数据中的输入元素块;
前述的卷积运算电路,所述卷积运算电路通过所述数据存储模块获取所述原始卷积核和所述输入元素块。
第四方面,本申请的实施例提供一种电子设备,包括:
系统总线;以及
前述的神经网络加速器,所述神经网络加速器与所述系统总线连接。
第五方面,本申请的实施例提供一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现以下步骤:
对原始卷积核进行切分处理得到多个子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
根据所述权重类型配置存储模块中相应的寄存器的配置信息;以使所述寄存器对应存储所述子卷积核中每个所述权重平面的权重类型;
控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
第六方面,本申请的实施例提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
对原始卷积核进行切分处理得到多个子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
根据所述权重类型配置存储模块中相应的寄存器的配置信息;以使所述寄存器对应存储所述子卷积核中每个所述权重平面的权重类型;
控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
第七方面,本申请的实施例提供一种计算机程序产品,其上存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:
对原始卷积核进行切分处理得到多个子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
根据所述权重类型配置存储模块中相应的寄存器的配置信息;以使所述寄存器对应存储所述子卷积核中每个所述权重平面的权重类型;
控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
可以理解,上述提供的第三方面所述的神经网络加速器、第四方面所述的电子设备所能达到的有益效果,可以参考上述如第一方面所述的卷积运算方法及其中任意一种实施例中的有益效果,在此不予赘述。上述提供的第五方面的所述的电子设备、第六方面所述的计算机可读存储介质以及第七方面所述的计算机程序产品所能达到的有益效果,可以参考上述如第二方面所述的卷积运算方法的有益效果,在此不予赘述。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其他特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例或传统技术中的技术方案,下面将对实施例或传统技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为一个实施例中卷积运算电路的结构示意图之一;
图2为一个实施例中卷积运算电路的结构示意图之二;
图3为一个实施例中原始卷积核切换成子卷积核的切分示意图;
图4为一个实施例中输入特征图与多个子卷积核卷积运算示意图;
图5为一个实施例中卷积运算电路的结构示意图之三;
图6为一个实施例中卷积运算电路的卷积运算示意图;
图7为一个实施例中卷积运算电路的结构示意图之四;
图8为一个实施例中卷积运算电路的结构示意图之五;
图9为一个实施例中卷积运算方法的流程示意图;
图10为一个实施例中控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理的流程示意图;
图11为另一个实施例中卷积运算方法的流程示意图;
图12为一个实施例中神经网络加速器的结构框图;
图13为一个实施例中电子设备的结构框图。
具体实施方式
为了便于理解本申请,下面将参照相关附图对本申请进行更全面的描述。附图中给出了本申请的实施例。但是,本申请可以以许多不同的形式来实现,并不限于本文所描述的实施例。相反地,提供这些实施例的目的是使本申请的公开内容更加透彻全面。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
可以理解,本申请所使用的术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。术语“第一”、“第二”等可在本文中用于描述各种元件,但这些元件不受这些术语限制。这些术语仅用于将第一个元件与另一个元件区分。此外,在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。在本申请的描述中,“若干”的含义是至少一个,例如一个,两个等,除非另有明确具体的限定。
还应当理解的是,术语“包括/包含”或“具有”等指定所陈述的特征、整体、步骤、操作、组件、部分或它们的组合的存在,但是不排除存在或添加一个或更多个其他特征、整体、步骤、操作、组件、部分或它们的组合的可能性。同时,在本说明书中使用的术语“和/或”包括相关所列项目的任何及所有组合。
在神经网络处理器或神经网络加速器中,会选择MAC阵列来对卷积核和输入特征图进行卷积处理,例如,乘累加处理。在相关技术中,MAC阵列中的每个MAC单元均会增设一硬件跳零电路进行子卷积核中的权重和输入特征图中的数据进行跳零处理,以跳过零值权重或零值输入数据的MAC单元的乘加运算以避免MAC单元的翻转。但是,相关技术中,每个MAC单元的硬件跳零电路在运行过程中,也会产生功耗,增加了乘累加运算的整体功耗。
基于以上原因,本发明提供了一种卷积运算电路,在执行该卷积运算过程中,其乘累 加阵列仅对应读取各子卷积核中各权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图进行乘累加处理。这样,减少了对权重平面中零值权重元素的数据搬移,实现了跳零处理,可以减少功耗;另外,还可以避免使用相关技术中在乘累加阵列中增加额外的跳零电路来对接收的权重元素是否零值权重进行判定,进一步减少了跳零电路本身的功耗开销,在节省功耗的前提下,进一步简化了卷积运算电路的结构设计。
在一个实施例中,如图1和图2所示,提供了一种卷积运算电路。其中,卷积运算电路包括处理单元110、存储模块120和乘累加(Multiply Accumulate,MAC)阵列130,简称MAC阵列。其中,存储模块120中包括至少一个寄存器121。
处理单元110,用于对原始卷积核20进行切分处理得到多个子卷积核210。其中,处理单元110可以从剪枝处理后的卷积神经网络中获取原始卷积核20,或,直接接收其他设备发送的卷积核数据进行剪枝处理后得到该原始卷积核20。原始卷积核20还可以理解为用于与待卷积数据进行卷积运算的卷积核,子卷积核210可以理解为对原始卷积核20进行切分处理得到的卷积核。其中,原始卷积核20和子卷积核210可以为二维卷积核,也可以为三维卷积核。
在一个示例中,子卷积核210可以包括多个通道,每一通道对应一个权重平面211。每个子卷积核210中可包括多个权重平面211,各权重平面211可包括多个权重元素。权重平面211内的各权重元素可以以平面形式存储在数据存储模块120中。
如图3所示,为了便于说明,在本申请实施例中以原始卷积核的尺寸为R×S×CH,子卷积核的尺寸为R×S×C为例进行说明。其中,CH为原始子卷积核的总通道数,C为自卷积核的通道数,C≤CH,其中,子卷积核的数量K=CH/C。其中,R、S分别为每个权重平面中在宽度方向上权重元素的数量,以及在高度方向上权重元素的数量。进一步的,R和S可以相等。
示例性的,若第一个权重平面211包括四个权重元素,例如,(w00,w01,w02,w03)。若权重元素的权重值为0,这该权重元素则为零值权重,若该权重元素的权值为非0,则该权重元素为非零权重。处理单元110可识别出权重平面211内各权重元素的零值权重和非零权重,并根据识别结果对应获取权重平面211的权重类型。其中,权重类型用于表征各权重平面211中零值权重的分布模式。处理单元110还可根据各权重平面211的权重类型对相应的寄存器121进行配置。对寄存器121进行配置后,各寄存器121则具有相应的配置信息。
请继续参考图1,乘累加阵列130,分别与处理单元110、存储模块120连接,用于响应于配置后的寄存器121的配置信息,对子卷积核210中各权重平面211中的非零权重与待卷积数据30中进行乘累加处理,其中,待卷积数据30的通道数与子卷积核210的通道数相同。
其中,待卷积数据30可以理解为输入特征图。待卷积数据30也可包括多个通道,其每个通道可对应一个输入平面310,即一个二维图像。当输入特征图的通道数大于1时,可以将输入特征图理解为多个通道的二维图像叠在一起的立体特征图,其深度等于通道数。其中,待卷积数据30的通道数C等于子卷积核210中的通道数C。为了便于说明,在本申请实施例中以待卷积数据30的尺寸为H×W×C为例进行说明。也即,每个输入平面310可具有W×H的尺寸,其中,W、H指示输入平面310中在宽度方向、高度方向输入元素的相应数量。
如图4所示,可通过执行子卷积核与输入特征图之间的卷积运算来生成输出特征图。在图4的示例中,子卷积核、输入特征图和输出特征图的数据可以以平面形式存储在存储器中。为了便于说明,在本申请实施例中,以5×5元素的四通道的输入特征图、四个3×3元素的四通道的子卷积核为例,阐述其卷积运算以输出特征图,也即二位图像。
四个3×3×4的子卷积核依次在5×5×4的输入特征图上移动,从而在输入特征图上产生移动窗(sliding window),其每次移动的间隔称为步长(stride),且步长小于卷积核数据的最短宽度,每移动一次便对窗内对应的数据进行一次卷积核数据大小的卷积运算,最后的结果被称为输出特征值,或输出特征图。其中,子卷积核的个数等于该输出特征图的通道数,也即,输出特征图可包括四个输出平面,每个输出平面的尺寸大小输入平面的尺寸大小相同。也就是说,输入特征图与一个子卷积核进行卷积运算后,得到一个二维图像。
在执行该卷积运算过程中,其乘累加阵列130可响应于寄存器121的配置信息,根据各寄存器121的配置信息,判定各个权重平面211内权重元素是否为零,并根据该判定结果,仅对应读取各子卷积核210中各权重平面211中的非零权重,而不读取权重平面211中零值权重,并将读取的非零权重与输入特征图对应的输入元素进行乘累加处理。示例性的,若某一权重平面211的元素为(w00,w01,0,0),乘累加阵列可响应于用于存储该权重平面211的寄存器121的配置信息,仅读取权重平面211的权重元素w00,w01,而不会读取w02,w03对应的零值权重,然后,基于读取的权重元素w00,w01与对应的输入特征图的输入平面310中的输入元素进行乘加运算。
上述卷积运算电路,包括处理单元110、存储模块120和乘累加阵列130,其中,处理单元110可对原始卷积核20进行切分处理得到至少一子卷积核210,获取各子卷积核210中每一通道对应的权重平面211的权重类型,并根据权重类型对存储模块120中的各寄存器121进行配置,进而使得乘累加阵列130能够基于寄存器121的配置信息来对子卷积核210中各权重平面211中的非零权重与待卷积数据30进行乘累加处理。这样,在执行该卷积运算过程中,其乘累加阵列130可响应于寄存器121的配置信息,根据各寄存器121的配置信息,仅对应读取各子卷积核210中各权重平面211中的非零权重,而不读取权重平面211中零值权重,并将读取的非零权重与输入特征图进行乘累加处理。这样,减少了对权重平面211中零值权重元素的数据搬移,实现了跳零处理,可以减少功耗;另外,还可以避免使用相关技术中在乘累加阵列130中增加额外的跳零电路来对接收的权重元素是否零值权重进行判定,进一步减少了跳零电路本身的功耗开销,在节省功耗的前提下,进一步简化了卷积运算电路的结构设计。
在其中一个实施例中,处理单元110还可获取各子卷积核210中各权重平面211中的零值权重的位置信息,并根据位置信息确定权重平面211的权重类型。处理单元110可预先设定权重平面211中零值权重的位置信息与权重类型之间的映射关系。
为了便于说明,本申请实施例以某一权重平面211包括w00,w01,w02,w03四个权重元素为例进行说明。示例性的,若权重平面211的权重元素为(1,3,0,0),可确定零值权重的位置在第三位置和第四位置,可确定该权重平面211的权重类型为第一权重类型。若子卷积核210的权重元素为(1,0,5,0),可确定零值权重的位置在第二位置和第四位置,可确定该权重平面211的权重类型为第二权重类型。若子卷积核210的权重元素为(0,0,5,2),可确定零值权重的位置在第一位置和第二位置,可确定该权重平面211的权重类型为第三权重类型。若子卷积核210的权重元素为(0,3,0,8),可确定零值权重的位置在第一位置和第三位置,可确定该权重平面211的权重类型为第四权重类型等等。
需要说明的是,权重平面211的权重类型的确定与零值权重的位置和数量相关联,并不限于上述举例说明。权重平面211的权重类型的确定还可以根据权重平面211中的权重元素的数量,以及权重元素中零值权重的分布模式来确定其权重类型。例如,当权重平面211包括w00,w01,w02,…,w08九个权重元素时,也可以基于权重平面211中的零值权重的数量和位置信息来确定其权重类型。
如图5所示,在一个实施例中,乘累加阵列130包括m行n列个子乘累加阵列131。其中,m、n均为大于或等于1的正整数。子乘累加阵列131可包括4个8bits×8bits的乘加单元,每个子乘累加阵列131可以计算最大元素数是256向量内积操作。其中,子乘累 加阵列131的行数m和列数n可以根据待卷积数据30和原始卷积核20的大小进行设定。m值小于或等于子卷积核210的通道数C,n值小于或等于子卷积核210的个数K。
在本申请实施例中,为了便于说明,以m=n=C=K=4为例进行说明。其中,乘累加阵列130包括4行4列个子乘累加阵列131,每个子乘累加阵列131可分别记为MAC(i,j),其中,1≤i≤4,1≤j≤4。
在一个实施例中,第i行第j列的子乘累加阵列131可分别接收第j子卷积核210中的第i个权重平面211中的权重元素,以及待卷积数据30中第i个输入平面310的输入元素。第i行第j列的子乘累加阵列131可根据与其连接的寄存器121的配置信息确定第i个权重平面211的权重类型,以确定出第i个权重平面211中的零值权重和非零权重,进而仅读取第i个权重平面211中的非零权重,以对非零权重与第i个输入平面310的输入元素进行乘累加处理。示例性的,MAC(1,1)可分别获取第1子卷积核210中的第1个权重平面211中的权重元素(w00,w01,w02,w03),以及待卷积数据30中第1个输入平面310中的输入元素(a00,a01,a02,a03)。MAC(1,1)可分配用于与对(a00,a01,a02,a03)和(w00,w01,w02,w03)进行乘加矩阵运算。
其中,存储模块120包括多个寄存器121,寄存器121的数量小于或等于子乘累加阵列131的数量。子乘累加阵列131可对应连接至一个寄存器121。,对于每个子乘累加阵列131连接的寄存器121,处理单元110还用于根据子乘累加阵列131所能够接收的权重平面211的权重类型来对该寄存器121进行配置。也即,与子乘累加阵列131连接的寄存器121的配置信息可用于表征该子乘累加阵列131所能够获取的子卷积核210中权重平面211的权重类型。,处理单元110可根据权重类型对寄存器121的配置,可对应配置各寄存器121的配置信息。其中,配置信息可以用寄存器121的值来表征,寄存器121的值可以为用二进制、八进制或十六进制进行表示。为了便于说明,以寄存器121的值用二进制表示为例进行说明。若权重类型为patten1(w00,w01,0,0),可配置对应的寄存器121的值为01,若权重类型为patten2(0,0,0,0),可配置对应的寄存器121的值为10若权重类型为patten3(00,0,w02,w03),可配置对应的寄存器121的值为11权重类型为patten4(w00,0,w02,0),可配置对应的寄存器121的值为100。需要说明的是,其寄存器121的值可以不限于上述举例说明,还可以为其他数值。
在一个实施例中,处理单元110与各子乘累加阵列131连接,还用于预先配置子乘累加阵列131的跳零工作模式,以使子乘累加阵列131能够响应于寄存器121的配置信息,对权重平面211中的各权重元素进行跳零处理,进而仅读取权重平面211中非零权重,并对非零权重与输入元素块进行乘累加处理。,处理单元110可根据权重平面211的权重类型预先配置子乘累加阵列131的跳零工作模式。其中,跳零工作模式可以理解子乘累加阵列131在进行乘加计算过程中,不会读取为对权重平面211中的零值权重,直接跳过零值权值的乘积运算。其中,跳零工作模式与权重平面211的零值权重相对应。示例性的,权重类型为第一权重类型,则子乘累加阵列131仅读取第一位和第二位上的非零权重,仅对第一位和第二位上的非零权重与对应的待卷积数据30进行乘累加处理。
子乘累加阵列131配置的跳零工作模式与该子乘累加阵列131连接的寄存器121的配置信息具有映射关系,使得子乘累加阵列131能够响应于寄存器121的配置信息,使其子乘累加阵列131工作在对应的跳零工作模式下,对获取的权重平面211中的各权重元素进行跳零处理,以读取权重平面211中非零权重,进而可对非零权重与输入元素块进行乘累加处理。
请继续参考图5,在一个实施例中,存储模块120包括C行K列个寄存器121;其中,m值小于或等于子卷积核210的通道数C,n值小于或等于子卷积核210的个数K。示例性的,存储模块120可包括4×4个寄存器121。示例性的,每个寄存器121可分别记为Reg(i,j),其中,1≤i≤4,1≤j≤4。也即,每个子乘累加阵列131被配置有一个独立的寄 存器121,其中,第i行第j列子乘累加阵列131可连接至寄存器Reg(i,j)。寄存器Reg(i,j)的值可用于表征第j子卷积核210中的第i个权重平面211的权重类型。
在本申请实施例中,为了便于说明,以子乘累加阵列MAC(1,1)为例进行说明。若第一个子卷积核210中的第一个权重平面211内的权重元素为(w00,w01,0,0),则处理单元110可确定该权重平面211的权重类型为第一权重类型,并配置寄存器Reg(1,1)的值为1。子乘累加阵列MAC(1,1)可根据寄存器Reg(1,1)的值获取权重平面211的权重类型,并判定权重元素w02,w03均为零值权重。如图6所示,子乘累加阵列MAC(1,1)仅读取权重元素w00,w01,而不读取权重元素w02,w03,同时,子乘累加阵列MAC(1,1)还会读取第一输入平面310的输入元素(a00,a01,a02,a03),并对输入元素(a00,a01,a02,a03)中非零权重(w00,w01)对应的输入元素(a00,a01)进行乘加运算,其子乘累加阵列MAC(1,1)的计算结果为Y(1,1)=a00×w00+a01×w01。第一输出平面Y1则为子乘累加阵列MAC(1,1)、MAC(2,1)、MAC(3,1)、MAC(4,1)的计算结果之和。
如图7和图8所示,在一个实施例中,存储模块120包括多个寄存器121,寄存器121的数量少于子乘累加阵列131的数量。其中,若各子卷积核210中至少两个权重平面211的权重类型相同,处理单元110还用于为同一子乘累加阵列131组配置同一寄存器121,其中,同一子乘累加阵列131组接收的各权重平面211的权重类型相同,同一子乘累加阵列131组包括至少两个子乘累加阵列131。也即,至少两个子乘累加阵列131可连接至同一寄存器121,以实现寄存器121的共存。
请继续参考图7,示例性的,若每个子卷积核210中的第一权重平面211的权重类型相同,则处理单元110那可根据第一权重平面211的权重类型来配置第一寄存器121的值,该第一寄存器121可与位于第一行的各子乘累加阵列MAC(1,j)连接。也即,第一行的各子乘累加阵列MAC(1,j)构成了子乘累加阵列组。
请继续参考图8,若第一个子卷积核210中的各权重平面211的权重类型相同,则处理单元110那可根据第一个子卷积核210中的各权重平面211的权重类型来配置第二寄存器121的值,该第二寄存器121可与位于第一列的各子乘累加阵列MAC(i,1)连接。也即,第一列的各子乘累加阵列MAC(i,1)构成了子乘累加阵列组。
在其中一个实施例中,处理器还可控制卷积神经网路对卷积核数据进行训练,以生成具有预设权重类型的原始卷积核20和子卷积核210。示例性的,可将卷积核数据训练成多个子卷积核210,其中,每个子卷积核210中的所有权重平面211的权重类型都相同,处理单元110可为同一子乘累加阵列组的各子乘累加阵列131配置一个寄存器121。其中,子乘累加阵列131组包括位于同一列的各子乘累加阵列131。也即,同一列的各子乘累加阵列131均连接至同一个寄存器121。处理单元110可根据每个子卷积核210的权重类型来对该寄存器121进行配置,以使位于同一列的子乘累加阵列131均能够响应于该寄存器121的值,对接收的权重平面211中的权重要素进行跳零处理。
示例性的,可将卷积核数据训练成多个子卷积核210,其中,各子卷积核210中的第k个权重平面211的权重类型都相同,处理单元110可为同一子乘累加阵列131组的各子乘累加阵列131配置一个寄存器121。其中,子乘累加阵列131组包括位于同一行的各子乘累加阵列131,也即,同一行的各子乘累加阵列131均连接至同一个寄存器121。处理单元110可根据第k个权重平面211的权重类型来对该寄存器121进行配置,以使位于同一行的子乘累加阵列131均能够响应于该寄存器121的值,对接收的权重平面211中的权重要素进行跳零处理。
在本申请实施例中,可以基于训练出的各子卷积核210中各权重平面211的权重类型类配置寄存器121的值,并将能够分别接收到相同权重类型的至少两个子乘累加阵列131连接至同一寄存器121,以实现寄存器121的共存,这样,可以在减少对权重平面211中零值权重元素的数据搬移,降低功耗的同时,还可以进一步简化卷积运算电路的结构,节 约成本。
在一个实施例中,当待卷积数据30变更时,处理单元110还用于对各子卷积核210中的各权重平面211的权重类型进行更新,以重新对配置各寄存器121进行配置。其中,待卷积数据30可以为待处理数据的一部分。示例性的,若待处理数据为待处理图像,待卷积数据30可以为待处理图像的至少部分图像块。若当前的待处理图像可以为前景图像块,下一待处理图像可以为背景图像块。其待卷积数据30需要有前景图像块切换至背景图像块,或由背景图像块切换至前景图像块时,其原始卷积核20也会随之变化。此时,处理单元110可对应获取新的子卷积核210中的各权重平面211的权重类型,并根据新的权重类型重新配置各寄存器121的值,各子乘累加阵列131可响应于与之连接的寄存器121的值,对新的子卷积核210中的各权重平面211的权重元素进行跳零处理,以读取非零权重与新的图像块进行乘累加运算。
本实施例中,若待卷积数据30发生变化时,处理单元110也能够适应的调整其对原始卷积核20数据,并能够获取新的子卷积核210中的各权重平面211的权重类型,并根据新的权重类型重新配置各寄存器121的值,各子乘累加阵列131可响应于与之连接的寄存器121的值,对新的子卷积核210中的各权重平面211的权重元素进行跳零处理,以读取非零权重与新的图像块进行乘累加运算,能够适用于对多组不同图像块的特征提取,在卷积运算过程中,还可以在减少对权重平面211中零值权重元素的数据搬移,降低功耗。
在一个实施例中,如图9所示,提供了一种卷积运算方法,该方法可应用于上述任一实施例中的卷积运算电路。卷积运算方法包括步骤902至步骤906。
步骤902,对原始卷积核进行切分处理得到多个子卷积核,获取各子卷积核中每一通道对应的权重平面的权重类型;权重类型用于表征各权重平面中零值权重的分布模式。
原始卷积核可以理解为用于与待卷积数据进行卷积运算的卷积核,子卷积核可以理解为对原始卷积核进行切分处理得到的卷积核。在一个示例中,子卷积核可以包括多个通道,每一通道对应一个权重平面。每个子卷积核中可包括多个权重平面,各权重平面可包括多个权重元素。权重平面内的各权重元素可以以平面形式存储在数据存储模块中。若权重元素的权重值为0,这该权重元素则为零值权重,若该权重元素的权值为非0,则该权重元素为非零权重。处理单元可识别出权重平面内各权重元素的零值权重和非零权重,并根据识别结果对应获取权重平面的权重类型。其中,权重类型用于表征各权重平面中零值权重的分布模式。
步骤904,根据权重类型配置存储模块中相应的寄存器的配置信息,以使寄存器对应存储子卷积核中每个权重平面的权重类型。
卷积运算电路可根据各权重平面的权重类型对相应的寄存器进行配置。对寄存器进行配置后,各寄存器则具有相应的配置信息。
步骤906,控制乘累加阵列响应于寄存器的配置信息,对子卷积核中各权重平面中的非零权重与待卷积数据进行乘累加处理。
待卷积数据可以理解为输入特征图。待卷积数据也可包括多个通道,其每个通道可对应一个输入平面,即一个二维图像。当输入特征图的通道数大于1时,可以将输入特征图理解为多个通道的二维图像叠在一起的立体特征图,其深度等于通道数。其中,待卷积数据的通道数等于子卷积核中的通道数。
在执行该卷积运算过程中,控制乘累加阵列可响应于寄存器的配置信息,根据各寄存器的配置信息,判定各个权重平面内权重元素是否为零,并根据该判定结果,仅对应读取各子卷积核中各权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图对应的输入元素进行乘累加处理。
上述卷积运算方法,包括对原始卷积核进行切分处理得到多个子卷积核,获取各子卷积核中每一通道对应的权重平面的权重类型;根据权重类型配置存储模块中相应的寄存器 的配置信息;以使寄存器对应存储子卷积核中每个权重平面的权重类型;控制乘累加阵列响应于寄存器的配置信息,对子卷积核中各权重平面中的非零权重与待卷积数据进行乘累加处理。这样,卷积运算方法在执行该卷积运算过程中,其乘累加阵列可响应于寄存器的配置信息,根据各寄存器的配置信息,仅对应读取各子卷积核中各权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图进行乘累加处理,可以减少对权重平面中零值权重元素的数据搬移,实现了跳零处理,可以减少功耗,提高卷积运算的效率。
需要说明的是,本申请实施例的卷积运算方法可以应用于多种场景,例如,诸如人脸识别、车牌识别等图像识别的领域,诸如图像特征提取、语音特征提取的特征领域,语音识别领域,自然语言处理领域等,将图像或者由其他形式的数据转换得到的图像输入到预先训练好的卷积神经网络,即可利用该卷积神经网络进行运算,以达到或分类或识别或特征提取的目的。
如图10所示,在一个实施例中,乘累加阵列包括m行n列个子乘累加阵列;其中,存储模块包括多个寄存器,每个子乘累加阵列连接至一个寄存器。其中,控制乘累加阵列响应于寄存器的配置信息,对子卷积核中各权重平面中的非零权重与待卷积数据进行乘累加处理,包括:
步骤1002,控制子乘累加阵列根据寄存器的配置信息对权重平面中的各零值权重进行跳零处理,以读取权重平面中非零权重。
步骤1004,根据读取的非零权重与待卷积数据进行乘累加处理。
在执行该卷积运算过程中,可控制乘累加阵列响应于寄存器的配置信息,根据各寄存器的配置信息,判定各个权重平面内权重元素是否为零,并根据该判定结果,仅对应读取各子卷积核中各权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图对应的输入元素进行乘累加处理。示例性的,若某一权重平面的元素为(w00,w01,0,0),乘累加阵列可响应于用于存储该权重平面的寄存器的配置信息,仅读取权重平面的权重元素w00,w01,而不会读取w02,w03对应的零值权重,然后,基于读取的权重元素w00,w01与对应的输入特征图的输入平面中的输入元素进行乘加运算。
在一个实施例中,步如图1002之前,还包括步骤1000,根据寄存器的配置信息配置子乘累加阵列的跳零工作模式。
跳零工作模式可以理解子乘累加阵列在进行乘加计算过程中,不会读取为对权重平面中的零值权重,直接跳过零值权值的乘积运算。其中,跳零工作模式与权重平面的零值权重相对应。示例性的,权重类型为第一权重类型,则子乘累加阵列仅读取第一位和第二位上的非零权重,仅对第一位和第二位上的非零权重与对应的待卷积数据进行乘累加处理。子乘累加阵列配置的跳零工作模式与该子乘累加阵列连接的寄存器的配置信息具有映射关系,使得子乘累加阵列能够响应于寄存器的配置信息,使其子乘累加阵列工作在对应的跳零工作模式下,对获取的权重平面中的各权重元素进行跳零处理,以读取权重平面中非零权重,进而可对非零权重与输入元素块进行乘累加处理。
在一个实施例中,获取各子卷积核中每一通道对应的权重平面的权重类型,包括:获取各子卷积核中各权重平面中权重元素的位置信息,并根据位置信确定权重平面的权重类型。
为了便于说明,本申请实施例以某一权重平面包括w00,w01,w02,w03四个权重元素为例进行说明。示例性的,若权重平面的权重元素为(1,3,0,0),可确定零值权重的位置在第三位置和第四位置,可确定该权重平面的权重类型为第一权重类型。若子卷积核的权重元素为(1,0,5,0),可确定零值权重的位置在第二位置和第四位置,可确定该权重平面的权重类型为第二权重类型。若子卷积核的权重元素为(0,0,5,2),可确定零值权重的位置在第一位置和第二位置,可确定该权重平面的权重类型为第三权重类型。若子卷积核的权重元素 为(0,3,0,8),可确定零值权重的位置在第一位置和第三位置,可确定该权重平面的权重类型为第四权重类型等等。需要说明的是,权重平面的权重类型的确定与零值权重的位置和数量相关联,并不限于上述举例说明。权重平面的权重类型的确定还可以根据权重平面中的权重元素的数量,以及权重元素中零值权重的分布模式来确定其权重类型。例如,当权重平面包括w00,w01,w02,…,w08等九个权重元素时,也可以基于权重平面中的零值权重的数量和位置信息来确定其权重类型。
如图11所示,在一个实施例中,卷积运算方法包括步骤1102至步骤1110。
步骤1102,对原始卷积核进行切分处理得到多个子卷积核,获取各子卷积核中每一通道对应的权重平面的权重类型;权重类型用于表征各权重平面中零值权重的分布模式。
步骤1104,根据权重类型配置存储模块中相应的寄存器的配置信息;以使寄存器对应存储子卷积核中每个权重平面的权重类型。
步骤1106,控制乘累加阵列响应于寄存器的配置信息,对子卷积核中各权重平面中的非零权重与待卷积数据进行乘累加处理。
步骤1102-步骤1106与前述实施例中步骤902-步骤906一一对应,在此,对步骤1102-步骤1106的具体步骤不再赘述。
步骤1108,当变更待卷积数据时,确定是否需要更新各子卷积核中的各权重平面的权重类型。
如图10,若需要更新各子卷积核中的各权重平面的权重类型,则根据更新后的权重类型,则重新对寄存器进行配置。
其中,待卷积数据可以为待处理数据的一部分。示例性的,若待处理数据为待处理图像,待卷积数据可以为待处理图像的至少部分图像块。若当前的待处理图像可以为前景图像块,下一待处理图像可以为背景图像块。其待卷积数据需要有前景图像块切换至背景图像块,或由背景图像块切换至前景图像块时,其原始卷积核也会随之变化。此时可对应获取新的子卷积核中的各权重平面的权重类型,并根据新的权重类型重新配置各寄存器的值,进而可重复执行步骤1106,以控制各子乘累加阵列响应于与之连接的寄存器的值,对新的子卷积核中的各权重平面的权重元素进行跳零处理,以读取非零权重与新的图像块进行乘累加运算。
本实施例中的卷积运算方法,若待卷积数据发生变化时,也能够适应的调整其对原始卷积核数据,并能够获取新的子卷积核中的各权重平面的权重类型,并根据新的权重类型重新配置各寄存器的值,控制各子乘累加阵列响应于与之连接的寄存器的值,对新的子卷积核中的各权重平面的权重元素进行跳零处理,以读取非零权重与新的图像块进行乘累加运算,能够适用于对多组不同图像块的特征提取,在卷积运算过程中,还可以在减少对权重平面中零值权重元素的数据搬移,降低功耗。
应该理解的是,虽然图9-图11的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图9-图11中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,提供了一种神经网络处理器,包括数据存储模块以及前述任一实施例中的卷积运算电路。其中,数据存储模块,存储原始卷积核和待卷积数据。卷积运算电路通过数据存储模块获取原始卷积核和待卷积数据。
如图12所示,在一个实施例中,提供了一种神经网络加速器,包括数据存储模块40以及前述任一实施例中的卷积运算电路10。其中,数据存储模块40,存储原始卷积核和 待卷积数据。卷积运算电路10通过数据存储模块40获取原始卷积核和待卷积数据。其中,数据存储模块40所存储的数据也可以是处理结果,或者说数据存储模块40所存储的数据是经过处理单元对待卷积数据进行处理后的数据。需要说明的是,数据存储模块240实际所存储的数据并不限于此,数据存储模块40还可以存储其他数据。
卷积运算电路10在对待卷积数据进行卷积处理过程中,其乘累加阵列可响应于寄存器的配置信息,根据各寄存器的配置信息,仅对应读取各子卷积核中各权重平面中的非零权重,而不读取权重平面中零值权重,并将读取的非零权重与输入特征图进行乘累加处理,可以减少对权重平面中零值权重元素的数据搬移,不仅可以节省数据缓存模块40的存储空间,还可以减少数据存储模块40的访问,降低功耗,提高神经网络处理器或神经网络加速器的运算效率。
上述任一实施例中的卷积运算电路还可以应用于具有MAC阵列的神经网络处理器,通过降低功耗,可以使得神经网络处理器可以工作在always on模式下,进而可以满足其神经网络处理器的整体电流小于5mA的设计需求。
在一个实施例中,上述任一实施例中的卷积运算电路还可以应用于与卷积,矩阵乘等所有以矩阵运算单元为基本单元的神经网络加速器中。
在一个实施例中,上述任一实施例中的卷积运算电路还可以应用于具有脉动阵列的神经网络加速器中。
如图13所示,在一个实施例中,提供了一种电子设备100,包括:系统总线以及如前述任一实施例中的神经网络加速器或神经网络处理器。其中,神经网络加速器或神经网络处理器中的数据存储模块40和卷积运算电路10分别与系统总线连接。需要说明的是,本申请实施例的神经网络处理器或神经网络加速器也可以与其他处理器、存储器等集成在一个芯片中。
该电子设备还包括通过系统总线连接的中央处理器50和外部存储器60。其中,该中央处理器50用于提供计算和控制能力,支撑整个电子设备的运行。外部存储器60可包括非易失性存储介质及内存储器。非易失性存储介质存储有操作系统和计算机程序。该计算机程序可被处理器所执行,以用于实现以下各个实施例所提供的一种卷积运算方法。该电子设备可以是手机、平板电脑、PDA(Personal Digital Assistant,个人数字助理)、POS(Point of Sales,销售终端)、车载电脑、穿戴式设备等任意终端设备。
在一个实施例中,提供了一种电子设备,包括存储器和处理器,存储器中存储有计算机程序,该处理器执行计算机程序时实现上述任一实施例中的卷积运算方法。
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述任一实施例中的卷积运算方法。
在一个实施例中,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述任一实施例中的卷积运算方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-Only Memory,ROM)、磁带、软盘、闪存或光存储器等。易失性存储器可包括随机存取存储器(Random Access Memory,RAM)或外部高速缓冲存储器。作为说明而非局限,RAM可以是多种形式,比如静态随机存取存储器(Static Random Access Memory,SRAM)或动态随机存取存储器(Dynamic Random Access Memory,DRAM)等。
在本说明书的描述中,参考术语“有些实施例”、“其他实施例”、“理想实施例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特征包含于本发明的 至少一个实施例或示例中。在本说明书中,对上述术语的示意性描述不一定指的是相同的实施例或示例。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种卷积运算电路,包括:
    存储模块,包括至少一寄存器;
    处理单元,用于对原始卷积核进行切分处理得到至少一子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型,并根据所述权重类型对相应的所述寄存器进行配置;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
    乘累加阵列,分别与所述处理单元、寄存器连接,用于响应于配置后的所述寄存器的配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
  2. 根据权利要求1所述的卷积运算电路,其中,所述乘累加阵列包括m行n列个子乘累加阵列;其中,m、n均为大于或等于1的正整数;
    所述存储模块包括多个寄存器,所述处理单元还用于根据各所述子卷积核中各所述权重平面的所述权重类型配置相应的所述寄存器的配置信息;其中,
    每个所述子乘累加阵列连接至一个所述寄存器,与所述子乘累加阵列连接的所述寄存器的配置信息用于表征所述子乘累加阵列接收的所述权重平面的权重类型;
    所述子乘累加阵列用于根据所述寄存器的配置信息对所述权重平面中的各权重进行跳零处理,以读取所述权重平面中非零权重,并对所述非零权重与所述待卷积数据进行乘累加处理。
  3. 根据权利要求2所述的卷积运算电路,其中,第i行第j列的所述子乘累加阵列用于分别接收第j子卷积核中的第i个权重平面的权重元素以及所述待卷积数据中第i个输入平面的输入元素,并根据所述子乘累加阵列连接的所述寄存器读取所述第i个权重平面中非零权重,并对所述非零权重与所述第i个输入平面的输入元素进行乘累加处理;其中1≤i≤m,1≤j≤n。
  4. 根据权利要求3所述的卷积运算电路,其中,所述存储模块包括多个寄存器,其中,若各所述子卷积核中至少两个所述权重平面的权重类型相同,所述处理单元还用于为同一子乘累加阵列组配置同一所述寄存器,其中,同一所述子乘累加阵列组接收的各所述权重平面的权重类型相同,且同一所述子乘累加阵列组包括至少两个所述子乘累加阵列。
  5. 根据权利要求4所述的卷积运算电路,其中,同一所述子乘累加阵列组包括位于同一行的各所述子乘累加阵列。
  6. 根据权利要求4所述的卷积运算电路,其中,同一所述子乘累加阵列组包括位于同一列的各所述子乘累加阵列。
  7. 根据权利要求2所述的卷积运算电路,其中,所述存储模块包括C行K列个寄存器;其中,m值小于或等于所述子卷积核的通道数C,n值小于或等于所述子卷积核的个数K,每个所述子乘累加阵列被配置有一个独立的所述寄存器,其中,与第i行第j列所述子乘累加阵列连接的所述寄存器的配置信息用于表征第j子卷积核中的第i个权重平面的权重类型;其中1≤i≤m,1≤j≤n。
  8. 根据权利要求2所述的卷积运算电路,其中,所述处理单元还用于获取各所述子卷积核中各所述权重平面中权重元素的位置信息,并根据所述位置信息确定所述权重平面的权重类型。
  9. 根据权利要求2所述的卷积运算电路,其中,所述处理单元还用于根据所述寄存器的配置信息配置所述子乘累加阵列的跳零工作模式,以使所述子乘累加阵列用于响应于所述寄存器的配置信息,对所述权重平面中的各权重元素进行跳零处理,以读取所述权重平面中非零权重,并对所述非零权重与所述待卷积数据进行乘累加处理。
  10. 根据权利要求1所述的卷积运算电路,其中,当所述待卷积数据变更时,所述处理单元还用于对各所述子卷积核中的各所述权重平面的权重类型进行更新,以重新配置所 述寄存器的配置信息。
  11. 一种卷积运算方法,包括:
    对原始卷积核进行切分处理得到多个子卷积核,获取各所述子卷积核中每一通道对应的权重平面的权重类型;所述权重类型用于表征各所述权重平面中零值权重的分布模式;
    根据所述权重类型配置存储模块中相应的寄存器的配置信息;以使所述寄存器对应存储所述子卷积核中每个所述权重平面的权重类型;
    控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理。
  12. 根据权利要求11所述的方法,其中,所述乘累加阵列包括m行n列个子乘累加阵列;其中,所述存储模块包括多个寄存器,每个所述子乘累加阵列连接至一个所述寄存器;其中,所述控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理,包括:
    控制所述子乘累加阵列根据所述寄存器的配置信息对所述权重平面中的各零值权重进行跳零处理,以读取所述权重平面中非零权重;其中,与所述子乘累加阵列连接的所述寄存器的配置信息用于表征所述子乘累加阵列接收的所述权重平面的权重类型;
    根据读取的所述非零权重与所述待卷积数据进行乘累加处理。
  13. 根据权利要求12所述的方法,其中,所述控制乘累加阵列响应于所述寄存器的所述配置信息,对所述子卷积核中各所述权重平面中的非零权重与待卷积数据进行乘累加处理前,还包括:
    根据所述寄存器的配置信息配置所述子乘累加阵列的跳零工作模式。
  14. 根据权利要求11所述的方法,其中,所述获取各所述子卷积核中每一通道对应的权重平面的权重类型,包括:
    获取各所述子卷积核中各所述权重平面中权重元素的位置信息;
    根据所述位置信确定所述权重平面的权重类型。
  15. 根据权利要求11-14任一项所述的方法,其中,所述方法还包括:
    当变更所述待卷积数据时,确定是否需要更新各所述子卷积核中的各所述权重平面的权重类型;
    若需要更新各所述子卷积核中的各所述权重平面的权重类型,则根据更新后的权重类型,则重新对所述寄存器进行配置。
  16. 一种神经网络加速器,包括:
    数据存储模块,用于存储原始卷积核和待卷积数据中的输入元素块;
    如权利要求1-10任一项所述的卷积运算电路,所述卷积运算电路通过所述数据存储模块获取所述原始卷积核和所述输入元素块。
  17. 一种电子设备,包括:
    系统总线;以及
    如权利要求16所述的神经网络加速器,所述神经网络加速器与所述系统总线连接。
  18. 一种电子设备,包括存储器和处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现权利要求11至15中任意一项所述的方法的步骤。
  19. 一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现权利要求11至15中任意一项所述的方法的步骤。
  20. 一种包含指令的计算机程序产品,当所述计算机程序在计算机上运行时,使得计算机执行权利要求11至15中任意一项所述的方法的步骤。
PCT/CN2022/113849 2021-09-03 2022-08-22 卷积运算电路及方法、神经网络加速器和电子设备 WO2023030061A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111030795.7A CN115759212A (zh) 2021-09-03 2021-09-03 卷积运算电路及方法、神经网络加速器和电子设备
CN202111030795.7 2021-09-03

Publications (1)

Publication Number Publication Date
WO2023030061A1 true WO2023030061A1 (zh) 2023-03-09

Family

ID=85332904

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/113849 WO2023030061A1 (zh) 2021-09-03 2022-08-22 卷积运算电路及方法、神经网络加速器和电子设备

Country Status (2)

Country Link
CN (1) CN115759212A (zh)
WO (1) WO2023030061A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116861973B (zh) * 2023-09-05 2023-12-15 深圳比特微电子科技有限公司 用于卷积运算的改进的电路、芯片、设备及方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200342294A1 (en) * 2019-04-26 2020-10-29 SK Hynix Inc. Neural network accelerating apparatus and operating method thereof
CN112465110A (zh) * 2020-11-16 2021-03-09 中国电子科技集团公司第五十二研究所 一种卷积神经网络计算优化的硬件加速装置
CN112633484A (zh) * 2019-09-24 2021-04-09 中兴通讯股份有限公司 神经网络加速器、卷积运算实现方法、装置及存储介质
CN113688976A (zh) * 2021-08-26 2021-11-23 哲库科技(上海)有限公司 一种神经网络加速方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200342294A1 (en) * 2019-04-26 2020-10-29 SK Hynix Inc. Neural network accelerating apparatus and operating method thereof
CN112633484A (zh) * 2019-09-24 2021-04-09 中兴通讯股份有限公司 神经网络加速器、卷积运算实现方法、装置及存储介质
CN112465110A (zh) * 2020-11-16 2021-03-09 中国电子科技集团公司第五十二研究所 一种卷积神经网络计算优化的硬件加速装置
CN113688976A (zh) * 2021-08-26 2021-11-23 哲库科技(上海)有限公司 一种神经网络加速方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115759212A (zh) 2023-03-07

Similar Documents

Publication Publication Date Title
CN110097174B (zh) 基于fpga和行输出优先的卷积神经网络实现方法、系统及装置
CA3069185C (en) Operation accelerator
CN108416422B (zh) 一种基于fpga的卷积神经网络实现方法及装置
US20210117768A1 (en) Data processing method, device, computer equipment and storage medium
CN109102065B (zh) 一种基于PSoC的卷积神经网络加速器
US20170344876A1 (en) Efficient sparse parallel winograd-based convolution scheme
CN110415157B (zh) 一种矩阵乘法的计算方法及装置
US11468145B1 (en) Storage of input values within core of neural network inference circuit
TW201942808A (zh) 深度學習加速器及加快深度學習操作的方法
KR20190066473A (ko) 뉴럴 네트워크에서 컨볼루션 연산을 처리하는 방법 및 장치
CN111079917B (zh) 张量数据分块存取的方法及装置
US11436017B2 (en) Data temporary storage apparatus, data temporary storage method and operation method
US11467968B2 (en) Memory-adaptive processing method for convolutional neural network
US11755683B2 (en) Flexible accelerator for sparse tensors (FAST) in machine learning
WO2023030061A1 (zh) 卷积运算电路及方法、神经网络加速器和电子设备
WO2022226721A1 (zh) 一种矩阵乘法器及矩阵乘法器的控制方法
WO2023065983A1 (zh) 计算装置、神经网络处理设备、芯片及处理数据的方法
WO2021232422A1 (zh) 神经网络的运算装置及其控制方法
US11467973B1 (en) Fine-grained access memory controller
CN110009091B (zh) 学习网络在等价类空间中的优化
US11409523B2 (en) Graphics processing unit
CN116051345A (zh) 图像数据处理方法、装置、计算机设备及可读存储介质
KR102372869B1 (ko) 인공 신경망을 위한 행렬 연산기 및 행렬 연산 방법
CN116721006A (zh) 特征图处理方法和装置
CN115344506B (zh) 内存地址的映射方法、内存访问方法和装置、芯片、设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863200

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE