US11468301B2 - Method and apparatus for performing operation of convolutional layer in convolutional neural network - Google Patents

Method and apparatus for performing operation of convolutional layer in convolutional neural network Download PDF

Info

Publication number
US11468301B2
US11468301B2 US16/203,017 US201816203017A US11468301B2 US 11468301 B2 US11468301 B2 US 11468301B2 US 201816203017 A US201816203017 A US 201816203017A US 11468301 B2 US11468301 B2 US 11468301B2
Authority
US
United States
Prior art keywords
folded
dimension
convolution
data
convolution kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US16/203,017
Other languages
English (en)
Other versions
US20190164045A1 (en
Inventor
Delin Li
Kun Ling
Liang Chen
Jianjun Li
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Horizon Robotics Technology Co Ltd
Original Assignee
Nanjing Horizon Robotics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Horizon Robotics Technology Co Ltd filed Critical Nanjing Horizon Robotics Technology Co Ltd
Assigned to NANJING HORIZON ROBOTICS TECHNOLOGY CO., LTD. reassignment NANJING HORIZON ROBOTICS TECHNOLOGY CO., LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: CHEN, LIANG, LI, DELIN, LI, JIANJUN, LING, Kun
Publication of US20190164045A1 publication Critical patent/US20190164045A1/en
Application granted granted Critical
Publication of US11468301B2 publication Critical patent/US11468301B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/10Machine learning using kernel methods, e.g. support vector machines [SVM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0454
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C11/00Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor
    • G11C11/21Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements
    • G11C11/34Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices
    • G11C11/40Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors
    • G11C11/41Digital stores characterised by the use of particular electric or magnetic storage elements; Storage elements therefor using electric elements using semiconductor devices using transistors forming static cells with positive feedback, i.e. cells not needing refreshing or charge regeneration, e.g. bistable multivibrator or Schmitt trigger
    • G11C11/413Auxiliary circuits, e.g. for addressing, decoding, driving, writing, sensing, timing or power reduction

Definitions

  • the present disclosure relates generally to a technical Held of convolutional neural network, and particularly to a method and an apparatus for performing an operation of a convolutional layer in a convolution neural network.
  • Deep learning technology based on a convolutional neural network has been widely applied to various fields such as image recognition, video analysis, natural language processing, auxiliary driving, and the like.
  • the amount of operations in the convolutional neural network is usually very high. It is expected that the operations in the convolutional neural network can be efficiently performed by hardware such as a universal Central Processor (CPU) and Graphics Processor (GPU) or a dedicated accelerator, and the like.
  • CPU Central Processor
  • GPU Graphics Processor
  • a method for performing an operation of a convolutional layer in a convolutional neural network may comprise padding unfolded-feature-data (or unfolded feature data) provided to the convolution layer according to a padding mode specified by the convolution layer, folding the padded unfolded-feature-data in at least one dimension of width and height so as to generate folded feature data, folding an original convolution kernel of the convolution layer in the at least one dimension so as to generate one or more folded convolution kernels corresponding to the original convolution kernel, and performing a convolution operation on the folded feature data by using the one or more folded convolution kernels.
  • the apparatus may comprise one or more processors configured to performing the above method.
  • the apparatus may comprise a pre-processing unit configured to pad unfolded-feature-data provided to the convolution layer according to a padding mode specified by the convolution layer, a first folding unit configured to fold the padded unfolded-feature-data in at least one dimension of width and height so as to generate folded feature data, a second folding unit configured to fold an original convolution kernel of the convolution layer in the at least one dimension so as to generate one or more folded convolution kernels corresponding to the original convolution kernel, and an arithmetic unit configured to perform a convolution operation on the folded feature data by using the one or more folded convolution kernels.
  • non-temporary storage medium having program instructions stored thereon for performing the above method when executed by a computing apparatus.
  • channel utilization may be improved, buffer footprint may be reduced, and operational efficiency may be improved.
  • FIG. 1 shows a flow chart of a method for performing an operation of a convolutional layer m a convolutional neural network according to an embodiment of the present disclosure.
  • FIG. 2 shows an example of folding unfolded-feature-data according to an embodiment of the present disclosure.
  • FIG. 3 shows an example of folding an original convolution kernel according to an embodiment of the present disclosure.
  • FIG. 4 shows an example of performing a convolution operation on folded feature data by using the folded convolution kernels according to an embodiment of the present disclosure.
  • FIG. 5 shows an example of performing a convolution operation on folded feature data by using the folded convolution kernels according to an embodiment of the present disclosure.
  • FIG. 6 shows a block diagram of an apparatus for performing an operation of a convolutional layer in a convolutional neural network according to an embodiment of the present disclosure.
  • FIG. 7 shows a block diagram of an apparatus for performing an operation of a convolutional layer in a convolutional neural network according to an embodiment of the present disclosure.
  • FIG. 8 shows an example of a device for performing a convolution operation on folded feature data according to an embodiment of the present disclosure.
  • FIGS. 9A and 9B show examples of how the feature data are stored in a static random access memory.
  • a feature data provided to a convolutional neural network may be regarded as a data cube, and may have a plurality of dimensions such as width, height, depth (i.e., different channels), and the like, wherein each data in the feature data may correspond to one point in the data cube, respectively. Accordingly, each convolution kernel of a weight parameter for a convolution operation in a convolutional neural network may also be regarded as a data cube.
  • a slice of the data cube in the first dimension corresponding to the dimension represented by the X-axis represents a result obtained by sampling the data in the data cube through using a plane orthogonal to the X-axis, which is a rectangular data on a two-dimensional plane represented by the Y-axis and the Z-axis.
  • x i, y ⁇ [0,H), x ⁇ [0,D) ⁇ , i ⁇ [0,W) ⁇ .
  • a slice with each contained data having a value of zero (or a value being equivalent to zero) may be called as a zero slice.
  • the terra “slice” is also used herein for convenience of descriptions when describing a feature data or a data of a convolution kernel in a certain dimension, for example, a slice in a dimension of width (called herein as “a width slice” for short), a slice in a dimension of height (called herein as “a height slice” for short), and so on.
  • a slice may include a plurality of pixels.
  • Padding or appending one or more zero slices in the first dimension (such as a dimension of width) of the data cube A may mean herein increasing a dimension value (such as width) of the first dimension of A by adding one or more zero slices at a certain boundary (for example, on left side or right side in width) in the first dimension of A, wherein each added zero slice has the same dimension values (for example, height value and depth value, respectively) as the original A in the other two dimensions (for example, the two dimensions of height and depth), respectively.
  • Padding or appending one or more zero slices in both the first dimension and the second dimension (such as both the dimensions of width and height) of the data cube A may meaning herein increasing the dimension value (e.g., width) of the first dimension of A by adding one or more zero slices at a certain boundary (e.g., left or right in width) in the first dimension of A, each added zero slice having the same dimension value (e.g., height value and depth value) as the original A in the other two dimensions (e.g., both the dimension of height and depth), and then adding one or more zero slices at a certain boundary (e.g., the upper side or lower side in height) in the second dimension of a data cube A′ obtained after increasing the width so as to add the dimension value (e.g., height) of the second dimension of A′, each added zero slice having the same dimension value (e.g., width value and depth value) as the A′ in the other two dimensions (e.g., both dimension of width and depth).
  • a certain boundary e.g
  • Aligning each slice of the data cube A in depth may mean herein padding zero (or a value equivalent to zero) in depth for a slice of A without an expected depth value (which may be either a width slice or a height slice), so that each slice of A after the padding has the expected depth value.
  • Padding in the first dimension and/or the second dimension of the data cube A means herein that the number of padded zero slices may be zero or one or more, unless otherwise specified.
  • the operation amount in the convolutional neural network is usually high, and it is expected that operations in a convolutional neural network can be performed efficiently by using hardware such as a universal Central Processor and Graphics Processor or a dedicated accelerator.
  • a memory supporting multiple channels may be designed for providing data to the adders and/or multipliers performing the convolution operation, or an arithmetic unit may be designed to support operations on multiple channels.
  • the number of channels of a feature data provided to an input layer of the convolutional neural network may be small (usually 3 channels or 1 channel), and the number of channels of an input feature data of a convolutional layer near the front in a feed forward inference direction of the convolution neural network may also be small
  • FIG. 1 shows an example method 100 for performing an operation of a convolutional layer in a convolutional neural network according to an embodiment of die present disclosure, and the method may include:
  • the hardware design may be simplified, the utilization of channel or hardware resources may be improved, and/or the parallelism of operations may be improved.
  • an example method 100 for performing an operation of a convolutional layer in a convolutional neural network may start from the step S 101 for padding unfolded-feature-data provided to the convolutional layer in a padding mode specified by the convolutional layer.
  • a convolution kernel having the same number of channels (i.e. the same depth) as the original unfolded-feature-data provided to the convolutional layer is designed for the original unfolded-feature-data
  • the convolution kernel is enabled to slide over the original unfolded-feature-data in a stride of S x (being greater than or equal to 1) in width and in a stride of S y (being greater than or equal to 1) in height
  • the data of a portion in the original unfolded-feature-data corresponding to the sliding window is convolved so as to obtain an output feature data (or activation value) with the number of channels being 1.
  • a plurality of convolution kernels may be designed for the convolutional layer, these convolution kernels form a weight parameter of the convolutional layer, and a plurality of results obtained by using these convolution kernels correspond to the data on different channels of the output feature data of the convolutional layer, respectively.
  • zero slices may be padded around the two dimensions of both width and height (including a starting boundary and an ending boundary in width, and a starting boundary and an ending boundary in height) of the original unfolded-feature-data in a specified padding mode, and the number of padded zero slices depends on the specified padding mode and may be zero, one or more.
  • the weight parameters including number of convolution kernels, width, height, depth, and contained value of each convolution kernel used in each convolutional layer and the padding mode for the original unfolded-feature-data provided to the convolutional layer are always known. These configurations may be specified in advance by a designer of the convolutional neural network when designing the convolutional neural network, and may also be designed or adjusted through learning.
  • the received input feature data is firstly pre-processed in the step S 101 , i.e. padding the received input feature data according to the padding mode specified by the convolutional layer, including padding zero, one or more zero slices at the starting boundary in width (on the left side) and/or the ending boundary in width (on the right side) and/or the starting boundary in height (on the upper side) and/or the ending boundary in height (on the lower side).
  • the padding amount on the left side and/or the upper side of the received input feature data i.e.
  • the number of zero slices to be padded may also be determined according to a padding mode specified by the convolutional layer, the padding amount on the right side and/or lower side of the received input feature data is then inferred according to the width and/or height of the expected output feature data, the width and/or height of the convolution kernel used for the convolution operation, and the stride of the convolution kernel in width and/or height and a padding is performed correspondingly.
  • the method 100 then proceeds to the step S 105 for folding the padded (pre-processed) unfolded-feature-data in at least one dimension of width and height.
  • the padded unfolded-feature-data FD from the step S 101 may be folded to generate FD′ in one dimension D 1 of width and height by splicing each N x consecutive slices of FD in D 1 (N x being also referred to herein as a splicing number in D 1 , which may be called as a splicing number for short in a case where the context is clear) together in depth, so that the data of the (i fx ⁇ N x +j fx )-th slice of FD in D 1 on all C x channels correspond to the data of the (i fx )-th slice of FD′ in D 1 on consecutive C x channels from the (j fx ⁇ C x )-th channel, wherein N x is an integer greater than 1, i fx is an integer greater than or equal to 0, j fx is an integer greater than or equal to 0 and less than N x , and C x is an integer greater than 0.
  • FD′ may be continually folded to generate FD′′ in another dimension D 2 of width and height by splicing each N y consecutive slices of FD′ in D 2 (N y being also referred to herein as a splicing number in D 2 , which may be called as a splicing number for short in a case where the context is clear) together in depth, so that the data of the (i fy ⁇ N y +j fy )-th slice of FD′ in D 2 on all C y channels correspond to the data of the (i fy )-th slice of FD′ in D 2 on consecutive C y channels from the (j fy ⁇ C y )-th channel wherein N y is an integer greater than 1, i fy is an integer greater than or equal to 0, j fy is an integer greater than or equal to 0 and less than N y , and C y is an integer greater than 0.
  • the top half of FIG. 2 shows an example of folding the padded unfolded-feature-data FD 1 in width.
  • the padded unfolded-feature-data FD 1 includes original data slices S 1 to S 6 , and includes one zero slice F for padding on the left side and one zero slice P for padding on right side.
  • FD 1 may be folded in width and folded feature data FD 1 ′ may be generated.
  • the width of the folded feature data FD 1 ′ becomes half of the width of the padded unfolded-feature-data FD 1
  • the depth (the number of channels) becomes twice of the depth (the number of channels) of the padded unfolded-feature-data FD 1 , such that the channel utilization is increased and the amount of computations in the direction of width is reduced.
  • the lower half of FIG. 2 shows an example of folding the padded unfolded-feature-data FD 2 in height.
  • the padded unfolded-feature-data FD 2 includes original data slices S 1 to S 4 and one zero slice P for padding on the upper side.
  • FD 2 may be folded in height and folded feature data FD 2 ′ may be generated.
  • the total number of height slice of FD 2 is 5, which is not an integer multiple of 2, such that no other slice of the height slices of FD 2 can be spliced to the slice S 4 , resulting in an inconsistence of the number of channels of each height slice of the folded feature data FD 2 ′.
  • the total number of height slices of FD 2 may be checked before folding. If the total number is not an integer multiple of the splicing number, one or more zero slices may be firstly appended on the lower side of FD 2 (not shown in FIG. 2 ), such that the total number of the height slices of FD 2 becomes an integer multiple of the splicing number.
  • the number of appended zero slices may be smaller than the splicing number, or may be larger than the splicing number as needed, such that the convolutional sliding window is always within the folded feature data for example when performing convolution by using the folded convolution kernel as described below.
  • one or more additional zero slices may be appended to the obtained folded feature data after folding such that the channel of each height slice of the appended folded feature data is aligned.
  • the feature or processing capacity of the hardware may be directly used.
  • a channel which is not occupied by actual data may be automatically regarded by the hardware as having a zero value.
  • the channel of each slice in the folded feature data (for example, FD 2 ′ in FIG. 2 ) may be aligned automatically by hardware.
  • the number of channels of the last width slice in the folded feature data is also possibly inconsistent with the number of channels of other width slices in a case of folding in width.
  • the padded unfolded-feature-data or the obtained folded feature data may be processed in width before or during folding or after folding, or processed automatically in width by means of feature of hardware, such that the channel of each width slice in the finally obtained folded feature data is aligned.
  • the height of the folded feature data FD 2 ′ becomes half of the height of the padded unfolded-feature-data FD 2
  • the depth (number of channels) becomes twice of the depth (number of channels) of the padded unfolded-feature-data FD 2 , such that the channel utilization is improved and the amount of operations in the height direction is reduced.
  • the folded feature data FD 1 ′ may be continually folded in height, or the folding feature data FD 2 ′ may be continually folded in width.
  • the difference between the further folding and the initial folding is only that the dimension of the folding and the object of the folding are different, for which the descriptions are therefore omitted herein.
  • the method according to the embodiment of the present disclosure is not limited to the padding mode for the original unfolded-feature-data, the number of width slices or height slices of the original unfolded-feature-data, and the splicing numbers for width folding or height folding.
  • the splicing number N x or N y may be 3, 4, or any other integer greater than 1.
  • the splicing numbers N x and/or N y for width folding or height folding may be configured according to the number of channels supported by the hardware (for example, the memory or arithmetic unit supporting multiple channels).
  • the splicing number N x in the dimension D 1 may be determined to be a certain value less than or equal to
  • the original convolution kernel of the convolution layer is folded in at least one dimension of width and height so as to generate one or more folded convolution kernels corresponding to the original convolution kernel.
  • a weight parameter of the convolutional layer may include one or more convolution kernels, each convolution kernel having the same width and height, and usually the same depth (i.e. the number of channels) as the feature data provided to the layer. Therefore, it will be appreciated that the following descriptions focus on any one of the original convolution kernels of the weight parameter. In other words, if the weight parameter of a convolutional layer includes a plurality of convolution kernels, each convolution kernel may be processed as below.
  • one or more transformed convolution kernels K[k x ] corresponding to the original convolution kernel K may be generated in the step S 110 by padding k z ⁇ S x zero slices at the starting boundary of the original convolution kernel K in D 1 , wherein S z is a stride of the original convolution kernel K in D 1 , and k x is an integer greater than or equal to 0.
  • three transformed convolution kernels corresponding to the original convolution kernel K may be generated by 0 zero slice, S x zero slices, and 2 ⁇ S x zero slices, respectively.
  • E x transformed convolution kernels K[k x ] corresponding to the original convolution kernel K may be generated.
  • each transformed convolution kernel K[k x ] may be respectively folded in D 1 by splicing each N x consecutive slices in D 1 together in depth so as to generate a corresponding folded convolution kernel C′[k x ] for each transformed convolution kernel K[k x ], such that the data of the (i kx ⁇ N x +j kx )-th slice in D 1 on all C x channels of each K′[k x ] correspond to the data of the (i kx )-th slice in D 1 of K[k x ] on the consecutive C x channels starting from the (j kx ⁇ C x )-th channel, wherein i kx is an integer greater than or equal to 0, and j kx is an integer greater than or equal to 0 and less than N x .
  • the generated transformed convolution kernel K[k x ] may have different dimension values in D 1 (for example, a width value in a case where D 1 is width), or there may be one or more transformed convolution kernels K[k x ] whose dimension value in D 1 is not an integer multiple of N x , resulting that the slices of the corresponding K′[k x ] are not aligned in depth.
  • a manner similar to padding or appending or adding the feature data before or during or after the folding as described above may be adopted to process similarly the trans formed convolution kernel K[k x ] before folding or during or after the folding, such that all of the transformed convolution kernels K[k x ] have the same dimension values in D 1 and all of the slices of the folded convolution kernel K′[k x ] are aligned in depth.
  • an expected dimension value EV x in D 1 of each transformed convolution kernel K[k x ] may also be determined based on E x , S x , N x , and the dimension values V x in D 1 of the original convolution kernel K.
  • K[k x ] may be adjusted by appending a zero slice at the ending boundary in D 1 of the transformed convolution kernel K[kx], such that the dimension value in D 1 of the adjusted transformed convolution kernel K[k x ] is equal to EV x , and the adjusted transformed convolution kernel K[k x ] may be then folded in D 1 to generate a corresponding folded convolution kernel K′[k x ].
  • each folded convolution kernel K′[k x ] is folded in D 2 in a manner similar to folding K in D 1 .
  • one or more trans formed convolution kernels K′[k x ,k y ] corresponding to K′[k x ] may be generated by padding k y ⁇ S y zero slices at the starting boundary in D 2 of the K′[k x ], respectively, wherein S y is the stride in D 2 of the original convolution kernel K, and k y is an integer greater than or equal to 0.
  • the maximum value of k y may be determined so as to control the number of transformed convolution kernels.
  • E y transformed convolution kernels K′[k x ,k y ] corresponding to K′[k x ] or E x ⁇ E y transformed convolution kernels K′[k x ,k y ] corresponding to the original convolution kernel K may be generated.
  • each transformed convolution kernel K′[k x ,k y ] may be respectively folded in D 2 in a manner of splicing each N y consecutive slices in D 2 together in depth so as to generate a corresponding-folded convolution kernel K′′[k x ,k y ] for each transformed convolution kernel K′[k x ,k y ], such that the data of the (i ky ⁇ N y +j ky )-th slice in D 2 of each K′[k x ,k y ] on all C y channels correspond to the data of the (i ky )-th slice in D 2 of each K′[k x ,k y ] on consecutive C y channels starting from the (j ky ⁇ C y )-th channel, wherein i ky is an integer greater than or equal to 0, and j ky is an integer greater than or equal to 0 and less than N y .
  • an expected dimension value EV y in D 2 of each transformed convolution kernel K′[k x ,k y ] may also be determined according to E y , S y , N y , and the dimension value V y in D 2 of the original convolution kernel K.
  • K′[k x ,k y ] may be adjusted by appending zero slices at the ending boundary in D 2 of the transformed convolution kernel K′[k x ,k y ], such that the dimension value in D 2 of the adjusted transformed convolution kernel K′[k x ,k y ] is equal to EV y , and then the adjusted transformed convolution kernel K′[k x ,k y ] may be folded in D 2 to generate a corresponding folded convolution kernel K′′[k x ,k y ].
  • FIG. 3 shows an example of folding the original convolution kernel K in width corresponding to the folded feature data FD 1 ′ in FIG. 2 .
  • the width V x of the original convolution kernel K is 3 (including width, slices KS 1 to KS 3 ) and the stride S x in width is 1, the number of transformed convolution kernels in the dimension of width corresponding to the original convolution kernel K can be determined to be 2.
  • a transformed convolution kernel Ka may be generated by padding or appending 0 zero slice on the left side of the original convolution kernel K
  • a transformed convolution kernel Kb may be generated by padding or appending 1 zero slice on the left side of the original convolution kernel K.
  • each of the transformed convolution kernels Ka and Kb is folded in width, and two folded convolution kernels Ka′ and Kb′ are generated.
  • each width slice of the folded convolution kernel Ka′ may be aligned in depth by supplementing the zero slice KA before folding or during or after the folding, or each width slice of the folded convolution kernel Ka′ may be aligned automatically in depth by means of hardware.
  • FIG. 3 only shows an example of folding the original convolution kernel in width. Folding the original convolution kernel in height, further folding the two folded convolution kernels Ka′ and Kb′ in height, and folding the original convolution kernel in height and further folding the generated folded convolution kernels in width are similar to the example in FIG. 3 for which the details are omitted herein.
  • step S 110 is illustrated after the step S 105 in FIG. 1 , it will be appreciated that the step S 110 may be performed before the step S 105 or in parallel with the step S 110 .
  • the padded unfolded-feature-data FD may be folded in D 1 in the step S 105 to obtain folded feature data FD′, and the original convolution kernel K is folded in D 1 in the step S 110 to obtain, for example, E x folded convolution kernels K′[k x ] (0 ⁇ k x ⁇ E x ). Then, the example method 100 proceeds to a step S 115 , to perform a convolution operation on the folded feature data FD′ by using the generated E x folded convolution kernels K′[k x ].
  • the stride of each folded convolution kernel K′[k x ] in D 1 is 1; otherwise, the stride in D 1 of each folded convolution kernel K′[k x ] is S x . Further, the stride in the other dimension D 2 of width and height of each folded convolution kernel K′[k x ] is the stride S y in D 2 of the original convolution kernel K.
  • FD′ may be continually folded in D 2 in step S 105 to obtain folded feature data FD′′, and E x folded convolution kernels K′[k x ] are folded in D 1 in the step S 110 to obtain E x ⁇ E y folded convolution kernels K′′[k x , k y ] (0 ⁇ k y ⁇ E y ). Then, the example method 100 proceeds to the step S 115 for performing a convolution operation on the folded feature data FD′′ by using the generated E x ⁇ E y folded convolution kernels K′′[k x , k y ].
  • the stride in D 1 of each folded convolution kernel K′′[k x ,k y ] is 1; otherwise, the stride in D 1 of each folded convolution kernel K′′[k x , k y ] is S x .
  • the stride in D 2 of each folded convolution kernel K′′[k x ,k y ] is 1; otherwise, the stride in D 2 of each folded convolution kernel K′′[k x , k y ] is S y .
  • all of the folded convolution kernels may be moved in D 1 or D 2 by the stride of the folded convolution kernel in D 1 or the stride in D 2 so as to perform convolution on another portion of the folded feature data.
  • the final output feature data can be obtained after performing convolutions on all of the portions of the folded feature data.
  • a convolution may be performed on the P+S 1 slice and the S 2 +S 3 slice in the folded feature data FD 1 ′ in the example of FIG. 2 by using the folded convolution kernel Ka′ in the example of FIG. 3 so as to obtain a partial value O 1 in the output feature data FD 1 ′′, and a convolution may be performed on the P+S 1 slice and the S 2 +S 3 slice in the folded feature data FD 1 ′ by using the folded convolution kernel Kb′ so as to obtain a partial value O 2 in the output feature data FD 1 ′′.
  • the folded convolution kernels Ka′ and Kb′ are moved in width so as to perform a convolution on the S 2 +S 3 slice and the S 4 +S 5 slice in the folded feature data FD 1 ′ to obtain partial values O 3 and O 4 in the output feature data FD 1 ′′.
  • the partial values O 5 and O 6 in the output feature data FD 1 ′′ are continually obtained.
  • convolutions may also be performed on the entire folded feature data by using each folded convolution kernel respectively. In such a case, it may not be necessary to modify the convolution instructions of the hardware. However, if one original convolution kernel corresponds to a plurality of folded convolution kernels, the partial results obtained by using each folded convolution kernel are distributed on different channels. The partial results distributed on different channels may be reorganized or expanded before providing the output feature data to the next layer of the convolutional neural network or taking the output feature data as the final output of the entire convolutional neural network, so as to obtain a complete output result on one channel.
  • a convolution may be performed on the entire folded feature data FD 1 ′ in the example of FIG. 2 by using the folded convolution kernel Ka′ in the example of FIG. 3 , and partial values O 1 , O 3 and O 5 in the output feature data FD 1 ′′ are obtained; then, a convolution is performed on the entire folded feature data FD 1 ′ by using the folded convolution kernel Kb′, and partial values O 2 , O 4 and O 6 in the output feature data FD 1 ′′ are obtained. Then, respective obtained partial values may be organized together to obtain a complete output result FD 1 ′′.
  • FIG. 4 and FIG. 5 only show example processes of performing convolutions in width.
  • the process of performing convolution in height is similar, for which the details are omitted herein.
  • a processing unit for example, a multiplier array for convolution operation
  • a convolution operation is to be performed on an RGB image (the number of channels being 3) of 720 ⁇ 1280 by using a 5 ⁇ 5 convolution kernel (with each stride in width and height being 1)
  • a comparison among the operation amounts of a conventional convolution i.e. performing the convolution on the original unfolded-feature-data by using the original convolution kernel
  • a widthwise folding-convolution i.e. folding the feature data and the original convolution kernel by every 2 slices in width and then performing convolution
  • width-height-wise folding convolution i.e. folding the feature data and the original convolution kernel by every 2 slices in width and height, respectively, and then performing convolution
  • the example data in Table 1 shows that the amount of operation may be significantly reduced (for example, the operation amount of the width-height-wise folding convolution is only 36% of the operation amount of the conventional convolution) and the rate of effective operations may be significantly improved (for example, the rate of effective operations of the width-height-wise folding convolution is improved by about 4 times compared with the conventional convolution) through folding the feature data and the convolution kernel and performing convolution operations by using the obtained folded feature data and the folded convolution kernel
  • FIG. 6 and FIG. 7 show a block diagram of an apparatus for performing an operation of a convolutional layer in a convolutional neural network according to an embodiment of the present disclosure.
  • the example apparatus 600 may include one or more processors 610 .
  • the processor 610 may be a processing unit of any form capable of processing data and/or executing instructions, such as a common CPU and GPU or a dedicated processor or accelerator for a neural network.
  • the processor 610 may perform the method according to of the embodiments of the present disclosure to fold the feature data and convolution kernel and perform convolution operations by using the folded feature data and the folded convolution kernel. Further, the processor 610 may also control other components in the apparatus 600 to perform expected functions.
  • the processor 610 may be connected to a memory 620 and an I/O interface 630 through a bus system and/or a connection mechanism in other forms (not shown).
  • the memory 620 may include a computer readable and writable storage medium in various forms, for example, a volatile memory and/or a non-volatile memory.
  • the volatile memory may include, for example, a random access memory (RAM) and/or a cache, etc.
  • the non-volatile memory may include, for example, a read only memory (ROM), a hard disk, a flash memory, etc.
  • the readable and writable storage medium may include, but are not limited to, an electric, a magnetic, an optical, an electromagnetic, an infrared, or a semiconductor system, apparatus, or device or any combination of the above.
  • the memory 620 may also be a RAM on a chip carrying a dedicated processor.
  • the memory 620 may include program instructions for instructing the device 600 to perform the method according to of the embodiments of the present disclosure to fold the feature data and convolution kernel and perform convolution operations by using the folded feature data and the folded convolution kernel.
  • the I/O interface 630 may be configured to provide parameters or data to the processor 610 and output the result data processed by the processor 610 .
  • the example apparatus 700 may include a pre-processing unit 710 , a first folding unit 720 , a second folding unit 730 , and an arithmetic unit 740 .
  • the pre-processing unit 710 may be configured to pad the unfolded-feature-data provided to the convolutional layer according to the padding mode specified by the convolutional layer. In one embodiment, for example, the pre-processing unit 710 may be configured to perform the step S 101 in the example method 100 as shown in FIG. 1 .
  • the first folding unit 720 can be configured to fold the padded unfolded-feature-data in at least one dimension of width and height to generate folded feature data.
  • the first folding unit 720 may be configured to perform the step S 105 in the example method 100 as shown in FIG. 1 .
  • the second folding unit 730 may be configured to fold the original convolution kernel of the convolution layer in the at least one dimension to generate one or more folded convolution kernels corresponding to the original convolution kernel.
  • the second folding unit 710 may be configured to perform the step S 110 in the example method 100 as shown in FIG. 1 .
  • the arithmetic unit 740 may be configured to perform a convolution operation on the generated folded feature data by using the generated one or more folded convolution kernels.
  • the arithmetic unit 740 may be configured to perform the step S 115 in the example method 100 as shown in FIG. 1 .
  • apparatus 600 and apparatus 700 shown in FIG. 6 and FIG. 7 are only examples but not limiting.
  • the apparatus according to the embodiment of the present disclosure may include other components and/or structure as needed.
  • FIG. 8 shows an example of a device for performing a convolution operation on folded feature data according to an embodiment of the present disclosure.
  • the device 1100 may include a host processor 1110 , a dynamic random access (DRAM) 1120 , and a convolution engine 1130 which may be interconnected with each other via a bus system 1101 .
  • DRAM dynamic random access
  • the host processor 1110 may be an ARM processor, a general-purpose Central Processor (CPU), or any other types of processors or controller, and can execute program instructions to control operations of other components in the device 1100 such as the DRAM 1120 and the convolution engine 1130 as described below.
  • CPU Central Processor
  • the DRAM 1120 may be a DDR RAM or any other types of DRAMs, and can temporarily store data read from a non-volatile storage such as a magnetic hard disk.
  • a non-volatile storage such as a magnetic hard disk.
  • the above-mentioned unfolded-feature-data and original convolution kernel for a convolution layer in a convolution neural network or program instructions to be executed by the host processor 1110 may be temporarily stored in the DRAM 1120 .
  • the convolution engine 1130 may read the unfolded-feature-data and the original convolution kernel from the DRAM 1120 to per form any one of the methods disclosed above.
  • the convolution engine 1130 may be formed as a chip, and its components and operations will be discussed below in detail.
  • the convolution engine 1130 may include an input buffer 1131 , which may be a static random access memory (SRAM).
  • the unfolded-feature-data and the original convolution kernel may be read from the DRAM 1120 and stored in the SRAM 1131 .
  • the unfolded-feature-data and the original convolution kernel may be stored in either the same SRAM 1131 or separated SRAMs. Before or while being stored in the SRAM 1131 , the unfolded-feature-data and the original convolution kernel may be padded and folded as described above with reference to FIGS. 1-3 .
  • padding, folding and storing of the unfolded-feature-data may be performed in one step. For example, while the unfolded-feature-data read from the DRAM 1120 are being written into the SRAM 1131 , additional zero values may be inserted into a data stream of the unfolded-feature-data, and the padded unfolded-feature-data are stored in a predetermined format into the SRAM 1131 so that the feature data stored in the SRAM 1131 have been padded and folded.
  • FIGS. 9A and 9B show examples of how the feature data FD 1 and FD 1 ′ in FIG. 2 are stored in the SRAM 1131 , respectively.
  • the SRAM 1131 may include a plurality of memory units 1141 arranged in plural columns 1140 , and each column 1140 may also be called as a “slice”.
  • Each memory unit 1141 may include a plurality of memory cells (not shown) for storing a plurality of bits, respectively.
  • each memory unit 1141 may store 8 bits, 16 bits or more.
  • the number of bits stored in each memory unit 1141 is also called as data width.
  • Each memory unit 1141 has an address, and the SRAM slice 1140 is continuously addressed in the column direction.
  • the plurality of memory cells in each memory unit 1141 may be read and written synchronously, and the plurality of SRAM slices 1140 may be read or written synchronously, so that the SRAM 1131 has a data width B*N where B is the data, width of the slice 1140 (or the unit 1141 ) and N is the number of slices 1140 included in the SRAM 1131 .
  • each memory unit 1141 has a data width of 64 bits and each pixel of the original feature data FD 1 includes 3 channels, each memory unit 1141 can store 8 data while only one pixel (3 data for 3 channels) is stored in each unit 1141 , and the remaining 40 (i.e., 64 ⁇ 3*8) bits of the unit 1141 are padded with 5 zero values, as shown in FIG. 9A .
  • the folded feature data FD 1 ′ two pixels may be stored in each unit 1141 , and at the end of each pixel is padded only one zero value, as shown in FIG. 9B . In another example, instead, two zero values may be padded at the end of the second pixel.
  • each memory unit 1141 has a larger data width
  • more pixels may be stored in each memory unit 1141 .
  • more data may be supplied in one period to a calculation unit 1133 described below for performing a convolution operation as compared with only one pixel being stored in one unit 1141 , thereby improving computation efficiency of the device 1100 .
  • the original convolution kernel may be read from the DRAM 1120 and written in the SRAM 1131 , and it may be padded and folded as described above to generate one or more folded convolution kernels.
  • Storage of the one or more folded convolution kernels may be similar to that of the folded feature data as described above with reference to FIGS. 9A and 9B , except that they are stored in different SRAM slices 1140 . Details of storage of the kernels in the SRAM 1131 will be omitted herein. It is appreciated that as the SRAM 1131 has a capability smaller than the DRAM 1120 , it may read only a portion of the feature data and a portion of the kernels at one time.
  • the folded feature data and the one or more folded convolution kernels may be mad from the SRAM 1131 into a calculation circuit 1133 to perform a convolution operation.
  • the calculation circuit 1133 may include a plurality of multipliers and a plurality of adders for the convolution operation.
  • the calculation circuit 1133 may simultaneously calculate products of plural pixels in the folded feature data each with a corresponding pixel of plural folded convolution kernels. By doing so repeatedly, a same portion of the folded feature data may be convolved by all the folded convolution kernels. For example, if the calculation circuit 1133 includes 256 multipliers, it may simultaneously multiply 8 pixels (each having 4 channels, 32 data in total) of the folded feature data each with a corresponding pixel (also having 4 channels) in 8 kernels, generating 64 (8 pixels*8 channels) data. As compared with a conventional case where the feature data is not folded, calculation efficiency is greatly improved.
  • the calculation results from the calculation circuit 1133 may be stored in an output buffer (SRAM) 1135 .
  • the input buffer 1131 and the output buffer 1135 are equipped with a crossbar 1132 and a crossbar 1134 , respectively, to facilitate data transform with the calculation circuit 1133 . If necessary, the calculation results may also be moved from the output buffer 1135 to the DRAM 1120 .
  • the wordings such as “comprise” and “include” are to be construed in an inclusive sense, as opposed to an exclusive or exhaustive sense, that is to say, in a sense of “including but not limited to”. Additionally, when used in the disclosure, the wordings of “herein”, “above”, “below” and similar wordings shall refer to the disclosure as a whole but not to any specific portion of the disclosure. When being permitted in the context, the wordings in singular or plural used in the above descriptions may also include the plural or singular, respectively.
  • the wording of “or” in reference to a list of two or more items covers all of the following interpretations of the wording; any of the items in the list, all of the items in the list, and any combination of the items in the list.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)
  • Error Detection And Correction (AREA)
US16/203,017 2017-11-28 2018-11-28 Method and apparatus for performing operation of convolutional layer in convolutional neural network Active 2041-07-16 US11468301B2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711212080.7A CN107844827B (zh) 2017-11-28 2017-11-28 执行卷积神经网络中的卷积层的运算的方法和装置
CN201711212080.7 2017-11-28

Publications (2)

Publication Number Publication Date
US20190164045A1 US20190164045A1 (en) 2019-05-30
US11468301B2 true US11468301B2 (en) 2022-10-11

Family

ID=61680547

Family Applications (1)

Application Number Title Priority Date Filing Date
US16/203,017 Active 2041-07-16 US11468301B2 (en) 2017-11-28 2018-11-28 Method and apparatus for performing operation of convolutional layer in convolutional neural network

Country Status (5)

Country Link
US (1) US11468301B2 (de)
EP (1) EP3489863A1 (de)
JP (1) JP6856609B2 (de)
KR (1) KR20190062305A (de)
CN (1) CN107844827B (de)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366328B2 (en) * 2017-09-19 2019-07-30 Gyrfalcon Technology Inc. Approximating fully-connected layers with multiple arrays of 3x3 convolutional filter kernels in a CNN based integrated circuit
US10402628B2 (en) * 2016-10-10 2019-09-03 Gyrfalcon Technology Inc. Image classification systems based on CNN based IC and light-weight classifier
CN109190758B (zh) * 2018-09-04 2021-06-15 地平线(上海)人工智能技术有限公司 用于展开卷积神经网络的张量数据的方法和装置
CN110968832B (zh) * 2018-09-29 2023-10-20 华为技术有限公司 一种数据处理方法和装置
US11037030B1 (en) * 2018-10-29 2021-06-15 Hrl Laboratories, Llc System and method for direct learning from raw tomographic data
CN109656623B (zh) * 2019-03-13 2019-06-14 北京地平线机器人技术研发有限公司 执行卷积运算操作的方法及装置、生成指令的方法及装置
CN111832585B (zh) * 2019-04-16 2023-04-18 杭州海康威视数字技术股份有限公司 图像处理的方法和装置
CN111914985B (zh) * 2019-05-10 2023-07-04 杭州海康威视数字技术股份有限公司 深度学习网络模型的配置方法、装置及存储介质
CN112133342B (zh) * 2019-06-25 2022-05-06 中电海康集团有限公司 存储器
US20200410319A1 (en) * 2019-06-26 2020-12-31 Micron Technology, Inc. Stacked artificial neural networks
CN110288090B (zh) * 2019-06-28 2023-11-07 广东中星微电子有限公司 训练卷积神经网络的方法及装置、计算机设备和存储介质
CN112215329B (zh) * 2019-07-09 2023-09-29 杭州海康威视数字技术股份有限公司 基于神经网络的卷积计算方法及装置
US11699081B2 (en) * 2019-12-20 2023-07-11 Meta Platforms, Inc. Systems and methods for reducing data movement during convolution operations in artificial neural networks
CN113191377A (zh) * 2020-01-14 2021-07-30 北京京东乾石科技有限公司 用于处理图像的方法和装置
DE102020201182A1 (de) * 2020-01-31 2021-08-05 Robert Bosch Gesellschaft mit beschränkter Haftung Hardwarebeschleunigte Berechnung von Faltungen
CN113807506B (zh) * 2020-06-11 2023-03-24 杭州知存智能科技有限公司 数据加载电路和方法
US11977969B2 (en) 2020-06-11 2024-05-07 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data loading
CN111860809B (zh) * 2020-06-18 2024-03-15 清华大学 采用填充后图像传感芯片进行首层卷积层处理的方法
KR102423047B1 (ko) 2020-11-17 2022-07-19 연세대학교 산학협력단 하드웨어로 구현되는 초해상도 장치를 위한 전처리 장치 및 방법
CN112633490B (zh) * 2020-12-31 2023-09-26 上海寒武纪信息科技有限公司 执行神经网络模型的数据处理装置、方法及相关产品
EP4273751A1 (de) 2020-12-31 2023-11-08 Cambricon Technologies Corporation Limited Datenverarbeitungsvorrichtung und verfahren zur ausführung eines modells eines neuronalen netzwerks und zugehörige produkte
CN112836803A (zh) * 2021-02-04 2021-05-25 珠海亿智电子科技有限公司 一种提高卷积运算效率的数据摆放方法
CN112799598B (zh) * 2021-02-08 2022-07-15 清华大学 一种数据处理方法、处理器及电子设备
CN116629320B (zh) * 2023-07-21 2023-11-28 美智纵横科技有限责任公司 神经网络的优化方法、装置、存储介质及芯片

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003060748A2 (en) 2002-01-10 2003-07-24 Massively Parallel Technologies, Inc. Parallel processing systems and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10497089B2 (en) * 2016-01-29 2019-12-03 Fotonation Limited Convolutional neural network

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003060748A2 (en) 2002-01-10 2003-07-24 Massively Parallel Technologies, Inc. Parallel processing systems and method

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
Aravind Vasudevan et al: "Parallel Multi Channel convolution using General Matrix Multiplication", 2017 IEEE 28th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), Jul. 3, 2017 (Jul. 3, 2017), pp. 19-24, XP055569367, DOI: 10.1109/ASAP.2017.7995254, ISBN: 978-1-5090-4825-0.
Du et al., A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things; https://arxiv.org/abs/1707.02973 ; Jul. 2017 (Year: 2017). *
Extended European Search Report for Application No. 18208762.7, dated Apr. 2, 2019, 13 pages.
Qiang Lan et al: "High Performance Implementation of 3D Convolutional Neural Networks on a GPU", Computational Intelligence and Neuroscience, vol. 2017, Nov. 8, 2017 (Nov. 8, 2017), pp. 1-8, XP055568904, US ISSN: 1687-5265, DOI: 10.1155/2017/8348671.
QIANG LAN, WANG ZELONG, WEN MEI, ZHANG CHUNYUAN, WANG YIJIE: "High Performance Implementation of 3D Convolutional Neural Networks on a GPU", COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, HINDAWI PUBLISHING CORPORATION, US, vol. 2017, 8 November 2017 (2017-11-08), US , pages 1 - 8, XP055568904, ISSN: 1687-5265, DOI: 10.1155/2017/8348671
Xiaoming Chen et al: "Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs", arxiv.org, Cornell University Library, 201 OLIN Library Cornell University Ithaca, NY 14853, May 29, 2017 (May 29, 2017), XP080766540, DOI: 10.1145/3061639.3062297 (6 pages).
XIAOMING CHEN; JIANXU CHEN; DANNY Z. CHEN; XIAOBO SHARON HU: "Optimizing Memory Efficiency for Convolution Kernels on Kepler GPUs", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 29 May 2017 (2017-05-29), 201 Olin Library Cornell University Ithaca, NY 14853 , XP080766540, DOI: 10.1145/3061639.3062297
Yuan Du et al: "A Streaming Accelerator for Deep Convolutional Neural Networks with Image and Feature Decomposition for Resource-limited System Applications", Sep. 15, 2017 (Sep. 15, 2017), XP055569322, Retrieved from the Internet: URL:https://arxiv.org/ftp/arxiv/papers/1709/1709.05116.pdf [retrieved on Mar. 14, 2019] (5 pages).

Also Published As

Publication number Publication date
CN107844827A (zh) 2018-03-27
JP2019125352A (ja) 2019-07-25
JP6856609B2 (ja) 2021-04-07
EP3489863A1 (de) 2019-05-29
KR20190062305A (ko) 2019-06-05
CN107844827B (zh) 2020-05-26
US20190164045A1 (en) 2019-05-30

Similar Documents

Publication Publication Date Title
US11468301B2 (en) Method and apparatus for performing operation of convolutional layer in convolutional neural network
US11822616B2 (en) Method and apparatus for performing operation of convolutional layers in convolutional neural network
US11500958B2 (en) Method and apparatus for performing convolution operation on folded feature data
JP7132824B2 (ja) ニューラルネットワークにおいてデコンボルーション演算を実行する装置及びその方法
EP3349153B1 (de) Verfahren und vorrichtung zur verarbeitung eines neuronalen konvolutionsnetzwerks (cnn)
US20190188237A1 (en) Method and electronic device for convolution calculation in neutral network
JP2019082996A (ja) 畳み込みニューラルネットワークにおいて演算を実行する方法および装置並びに非一時的な記憶媒体
CN110334798B (zh) 特征数据提取方法及装置、指令生成方法及装置
DE112020004625T5 (de) Transponierte faltung mit systolischem array
US10642622B2 (en) Arithmetic processing device and control method of the arithmetic processing device
EP3093757B1 (de) Mehrdimensionale schiebefensteroperation für einen vektorprozessor
US20210065328A1 (en) System and methods for computing 2-d convolutions and cross-correlations
WO2022007265A1 (zh) 一种膨胀卷积加速计算方法及装置
JP6955598B2 (ja) 複数の畳み込みウィンドウ内の画像データの並行抽出方法、装置、機器及びコンピュータ可読記憶媒体
CN114065119A (zh) 数据处理方法及相关产品
US20240046413A1 (en) Methods of batch-based dnn processing for efficient analytics
Shi et al. HUGE2: a highly untangled generative-model engine for edge-computing
US20230075264A1 (en) Methods and devices for efficient general deconvolution implementation on hardware accelerator
CN115205513A (zh) 图像预处理方法和装置、图像处理方法、电子设备及介质
CN117413280A (zh) 具有内核扩展和张量累积的卷积

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED

AS Assignment

Owner name: NANJING HORIZON ROBOTICS TECHNOLOGY CO., LTD., CHINA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, DELIN;LING, KUN;CHEN, LIANG;AND OTHERS;REEL/FRAME:048858/0642

Effective date: 20190409

Owner name: NANJING HORIZON ROBOTICS TECHNOLOGY CO., LTD., CHI

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LI, DELIN;LING, KUN;CHEN, LIANG;AND OTHERS;REEL/FRAME:048858/0642

Effective date: 20190409

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: NON FINAL ACTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY