WO2022007266A1 - 一种卷积神经网络的加速方法及装置 - Google Patents

一种卷积神经网络的加速方法及装置 Download PDF

Info

Publication number
WO2022007266A1
WO2022007266A1 PCT/CN2020/126196 CN2020126196W WO2022007266A1 WO 2022007266 A1 WO2022007266 A1 WO 2022007266A1 CN 2020126196 W CN2020126196 W CN 2020126196W WO 2022007266 A1 WO2022007266 A1 WO 2022007266A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
input data
input
sliding window
computing
Prior art date
Application number
PCT/CN2020/126196
Other languages
English (en)
French (fr)
Inventor
徐兵
张楠赓
Original Assignee
嘉楠明芯(北京)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 嘉楠明芯(北京)科技有限公司 filed Critical 嘉楠明芯(北京)科技有限公司
Priority to US18/015,308 priority Critical patent/US20230289230A1/en
Publication of WO2022007266A1 publication Critical patent/WO2022007266A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/80Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the invention belongs to the field of deep learning, and in particular relates to a method and device for accelerating a convolutional neural network.
  • CNN Convolutional Neural Networks
  • NPU neural network chips
  • a commonly used method is to optimize the convolution operation by using the Im2col function.
  • the whole picture is usually not processed at one time, but the picture is first divided into multiple small patches (patches), and each patch needs to be rearranged by Im2col processing.
  • Embodiments of the present invention provide a method and device for accelerating a convolutional neural network. With this method and device, the above problems can be solved.
  • a method for accelerating a convolutional neural network includes: dividing a weight matrix of a convolutional layer into multiple weight segments by row, and buffering the multiple weight segments into multiple weight segments in a computing unit array respectively. multiple calculation units; read the multiple input data streams corresponding to the multiple weight segments, and input the multiple input data streams into the multiple rows of calculation units in parallel, wherein the input data stream is based on the multiple input data streams in the input feature map of the convolution layer. It consists of row data splicing; inside each computing unit, sliding window operation and multiply-accumulate operation are performed on the input data stream based on the cached weight segment to obtain the output feature map of the convolution layer.
  • reading the multiple input data streams corresponding to the multiple weight segments further includes: for each weight segment, determining multiple rows in the input feature map according to the convolution step size of the convolution layer data, and read multiple lines of data in turn to splicing to form the input data stream corresponding to each weight segment.
  • performing a sliding window operation and a multiply-accumulate operation on the input data stream based on the buffered weight segment further comprising: in each computing unit, using the buffered weight segment as a sliding window, and convolution
  • the convolution step size of the layer is used as the sliding step size, and the sliding window operation is performed on the input data stream input to each computing unit, and the multiplication and accumulation operation is performed according to the weight segment and the data in the window.
  • the method further includes: dividing each weight segment into multiple groups; Different groups of each weight segment are assigned to corresponding computing units, so that each computing unit repeats the sliding window operation and convolution operation on the input data stream based on the currently buffered grouping, so as to obtain different output features in different time periods. and, superimposing the obtained output feature sub-graphs.
  • the method further includes: determining an index offset value of the sliding window operation according to the group currently buffered by each computing unit, where the index offset value is used to indicate the position of the initial sliding window.
  • an acceleration device for a convolutional neural network comprising: a logic control unit and an array of computing units, wherein each computing unit includes: a buffer unit, a control unit, and a multiply-accumulate unit; wherein the logic control unit is configured as Used to: split the weight matrix of the convolution layer into multiple weight segments by row, and cache the multiple weight segments to multiple computing units in the computing array respectively; and, read the corresponding multiple weight segments respectively.
  • the control unit is configured to perform a sliding window operation on the input data stream based on the buffered weight segment; the multiply-accumulate unit is configured to perform a multiply-accumulate operation.
  • the logic control unit is configured to: for each weight segment, determine multiple rows of data in the input feature map according to the first convolution step size of the convolution layer, and sequentially read multiple rows of data.
  • the row data is spliced to form the input data stream corresponding to each weight segment.
  • control unit inside each computing unit, the control unit is configured to: use the buffered weight segment as a sliding window, and use the second convolution step size of the convolution layer as the sliding step size, A sliding window operation is performed on the input data stream input to each computing unit; the multiply-accumulate unit is configured to: perform a multiply-accumulate operation according to the weight segment and the data in the window.
  • the logic control unit is configured to: divide each weight segment into multiple groups; at different times In the segment, the different groups of each weight segment are cached to the corresponding computing unit, so that each computing unit repeats the sliding window operation and convolution operation on the input data stream based on the currently cached grouping, so as to obtain in different time periods. different output feature sub-maps; and, superimposing the obtained output feature sub-maps.
  • the logic control unit is configured to: determine an index offset value of the sliding window operation according to the group currently buffered by each computing unit, where the index offset value is used to indicate the position of the initial sliding window .
  • the above-mentioned at least one technical solution adopted in the embodiments of the present application can achieve the following beneficial effects: without using the Im2col function, the weight matrix of the convolution layer is split, and the weight segments obtained by splitting are cached in each computing unit, and the input The multi-line data in the feature map is spliced to form the input data stream of each line of computing units.
  • Each computing unit performs sliding window operations and multiply-accumulate operations on the input data streams based on the cached weight segment to achieve accelerated convolution operations.
  • Fig. 1 is the schematic diagram of the convolution operation based on Im2col in the prior art
  • FIG. 2 is a schematic structural diagram of a convolutional neural network computing device
  • FIG. 3 is a schematic flowchart of a method for accelerating a convolutional neural network according to an embodiment of the present invention
  • Figure 4 is a schematic diagram of a three-dimensional CNN convolution
  • PE computing unit
  • FIG. 6 is a schematic diagram of performing a sliding window operation using a computing unit (PE) that caches weighted segments according to an embodiment of the present invention
  • FIG. 7 is a schematic structural diagram of a convolutional neural network acceleration apparatus according to an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a computing unit (PE) of a convolutional neural network acceleration device according to an embodiment of the present invention.
  • PE computing unit
  • FIG. 2 shows a schematic structural diagram 20 of a convolutional neural network computing device. It includes: a computing platform 21 and an external memory 22, the computing platform 21 at least includes a computing unit (PE) array 211 for performing convolution calculations and an internal memory 212, wherein the external memory 22 usually selects a low-cost storage medium, usually with a bandwidth of Limited and high read and write power consumption.
  • Internal memory usually uses a storage medium with faster access speed, such as SRAM, which has higher bandwidth and lower read and write costs, but usually has a higher cost, so its capacity is generally limited.
  • FIG. 3 shows a schematic flowchart of a method 300 for accelerating a convolutional neural network according to an embodiment of the present invention.
  • Various aspects of the acceleration method 300 of the convolutional neural network in FIG. 3 will be described in detail below with reference to the convolutional neural network computing device shown in FIG. 2 .
  • the method 300 may include:
  • Step 301 Split the weight matrix of the convolution layer into multiple weight segments by row, and cache the multiple weight segments to multiple computing units (PE) in the computing unit (PE) array respectively;
  • Step 302 Read the multiple input data streams corresponding to the multiple weight segments respectively, and input the multiple input data streams to the multi-row computing unit (PE) in parallel,
  • PE multi-row computing unit
  • Step 303 Inside each computing unit (PE), perform sliding window operation and multiply-accumulate operation on the input data stream based on the buffered weight segment to obtain the output feature map of the convolutional layer.
  • PE computing unit
  • Figure 4 is a schematic diagram of a three-dimensional CNN convolution, wherein, for any convolutional layer of the convolutional neural network, it is assumed that the input feature map of the convolutional layer is matrix data composed of several D abc , and its size is 6 ( column)*6(row)*N(number of input channels), where the subscript a of D abc represents the input channel serial number, which takes the value 0,1,...,(N-1); the subscript b represents the row serial number, which takes The value is 0, 1, ..., 5; the subscript c represents the column serial number, and the value is 0, 1, ..., 5; the input feature map is usually continuously stored in the external memory 22 row by row and channel by channel, where N is A positive integer greater than 2.
  • the weight matrix of the convolutional layer is data composed of several W mnpq , usually stored in the external memory 22, and its size is 3(columns)*3(rows)*N(number of input channels)*2(number of output channels) ), where the subscript n of W mnpq represents the input channel serial number, and the value is 0,1,...,(N-1); the subscript p represents the row number, and the value is 0,1,2; the subscript q represents Column serial number, value 0, 1, 2; subscript m represents the output channel serial number, value 0, 1, corresponding to filter 0 and filter 1 in Figure 4 respectively.
  • the output feature map is a matrix data composed of several P xyz , and its size is 4 (columns) * 4 (rows) * 2 (the number of output channels), where the subscript x of P xyz represents the output channel number, and takes the value 0 , 1; the subscript y represents the row number, and the value is 0, 1, 2, 3; the subscript z represents the column number, and the value is 0, 1, 2, 3; in this embodiment, the desired convolution operation is also That is, the operation of mapping the output feature map from the input feature map and the weight matrix.
  • the weight matrix of the convolution layer can be read from the external memory 22 into the internal memory 212, and then the weight matrix Lines are split to obtain multiple weight segments.
  • the weight matrix shown in FIG. 4 can be divided into 3*N*2 weight segments, each weight segment corresponds to a row of weight values of the weight matrix, and the size is 3(columns)*1(rows)*1 (number of input channels)*1 (number of output channels);
  • FIG. 5 is a schematic diagram of performing convolution by using a computing unit (PE) that caches weighted segments according to this embodiment. As shown in FIG.
  • PE computing unit
  • a plurality of weighted segments are Cache to multiple calculation units (PE) in the calculation unit (PE) array, for example , cache (W 0000 ,W 0001 ,W 0002 ) to one calculation unit (PE), and (W 0010 ,W 0011 ,W 0012 ) to another compute unit (PE), and so on.
  • PE calculation unit
  • weight matrix multiple weight segments corresponding to the same output channel are cached in the same column of the computing unit (PE) array, for example, two columns of computing units (PE) can be used for caching respectively Individual weight segments in filter 0 and filter 1. Further, if different weight segments correspond to the same input data stream, the different weight segments are arranged in the same row of the computing unit (PE) array.
  • PE computing unit
  • step 302 the multiple input data streams corresponding to the multiple weight segments are read, and the multiple input data streams are input to the multi-row computing unit (PE) in parallel.
  • PE multi-row computing unit
  • the data D abc faced by each weight segment is different.
  • the input data streams are input to the multi-row Compute Elements (PE) in parallel.
  • PE Compute Elements
  • the above step 302 further includes: for each weight segment, determining multiple lines of data in the input feature map according to the convolution step size of the convolutional layer, and sequentially reading the multiple lines of data to splicing to form each The input data stream corresponding to each weight segment. For example, assuming that the stride of the convolutional layer is k, then the input data stream corresponding to the weight segment (W 0000 , W 0001 , W 0002 ) consists of lines 0, line k, ... in the corresponding input channel of the input feature map of Figure 4.
  • the input data stream corresponding to the weight segment (W 0010 , W 0011 , W 0012 ) consists of row 1, row k+1, ..., row (E-1)k in the corresponding input channel of the input feature map in FIG.
  • the input data stream corresponding to the weight segment (W 0020 , W 0021 , W 0022 ) is composed of line 2, line k+2, ..., line ( E-1)
  • step 303 inside each computing unit (PE), a sliding window operation and a multiply-accumulate operation are performed on the input data stream based on the buffered weight segment to obtain the output feature map of the convolution layer.
  • PE computing unit
  • the above step 303 further includes: inside each computing unit (PE), using the buffered weight segment as a sliding window, and using the convolution step size of the convolution layer as the sliding step size, for the input
  • the input data stream of each computing unit (PE) performs sliding window operation, and performs multiplication and accumulation operations according to the weight segment and the data in the window.
  • FIG. 6 shows a schematic diagram of performing sliding window operation by using a computing unit (PE) that caches weighted segments. As shown in FIG. 6 , for the weighted segments (W 0000 , W 0001 , W 0002 ), the corresponding input data input feature FIGS D abc flow of FIG.
  • the unit (PE) also performs synchronous sliding on its corresponding input data stream based on the cached weight segment, and performs multiply-add operations based on the weight segment and the data in the window, for example, the weight segment (W 0010 , W 0011 , W 0012 ) slides to (D 01
  • calculating means (PE) multiply-accumulate result at timing T 1, ie, in turn may correspond to filter a computing unit 0 (PE) obtained by adding an output channel of all by the accumulation result at timing T 1 output value P 000 , add all the multiplication and accumulation results of a column of computing units (PE) corresponding to filter 1 at time sequence T 1 to obtain the output value P 100 of another output channel; at time sequence T 2 , the weight segment (W 0000 , W 0001 , W 0002 ) slide k values on the input data stream, assuming k 1, then slide to (D 001 , D 002 , D 003 ), and perform a multiply-add operation based on the weight segment in the window and the input data: W 0000 ⁇ D 001 +W 0001 ⁇ D 002 +W 0002 ⁇ D 003 ... and so on.
  • each value in the output feature map in Figure 4 can be obtained.
  • the data in the input feature map can be read from the external memory in a row, and the data can be input into each row of computing units (PE) in parallel in the form of data streams, without the need for simultaneous data on the external memory.
  • PE computing units
  • Frequent cross-row or cross-column reading is performed based on the input feature maps stored continuously in a single storage direction, so convolution operations of different sizes can be supported without special design of the memory layout.
  • the method further includes: dividing each weight segment into multiple groups; in different time periods, Buffer different groups of each weight segment to the corresponding computing unit (PE) respectively, so that each computing unit (PE) repeats the sliding window operation and convolution operation on the input data stream based on the currently buffered grouping, so that at different times Obtain different output feature sub-maps within the segment; and superimpose the obtained output feature sub-maps.
  • each weight segment can be divided into multiple groups.
  • a weight segment is (W 0000 , W 0001 ,...,W 0009 )
  • it can be divided into the first group: (W 0000 ,W 0001 ,...,W 0004 ) and the second grouping: (W 0005 ,W 0006 ,...,W 0009 ).
  • the first grouping of multiple weight segments can be read from the external memory, cached in the corresponding computing unit (PE), and the input data stream corresponding to each weight segment can be read from the external memory and stored.
  • the corresponding computing unit (PE) is input, and each computing unit (PE) performs a first sliding window operation and a multiply-accumulate operation on the input data stream based on the buffered first packet to obtain a first output feature sub-map.
  • the second grouping of multiple weight segments is read from the external memory, and buffered to the corresponding computing unit (PE) to replace the original first grouping,
  • the input data stream corresponding to each weight segment is repeatedly read from the external memory and input to the corresponding computing unit (PE), and each computing unit (PE) performs a second sliding window operation and multiplication on the input data stream based on the buffered second packet Accumulating operation to obtain a second output feature sub-map; further, performing matrix accumulation on the first output feature sub-map and the second output feature sub-map to output the output feature map of the convolution layer.
  • the method further includes: determining an index offset value of the sliding window operation according to the group currently buffered by each computing unit (PE), where the index offset value is used to indicate the position of the initial sliding window.
  • PE computing unit
  • the sliding start position of each group on the input data stream is different during the sliding window operation.
  • the first group (W 0000 ,W 0001 ,..., W 0004 )
  • the sliding start position in each row of data in the input data stream does not need to be offset
  • the second grouping (W 0005 ,W 0006 ,...,W 0009 )In each row of data in the input data stream
  • the sliding starting position of needs to be offset by 5 values
  • the 5 values are the number of values before the second grouping in each weight segment.
  • the index offset value may be s*L, where s is used to indicate the sequence number of the group, and takes values 0, 1, 2, . . ., and L is used to indicate the size of the group.
  • an embodiment of the present invention also provides an acceleration device for a convolutional neural network, which includes: a logic control unit and a computing unit (PE) array, wherein each computing unit (PE) includes: A buffer unit, a control unit, and a multiply-accumulate (MAC) unit;
  • PE computing unit
  • MAC multiply-accumulate
  • the logic control unit 71 and the computing unit (PE) array 211 may be provided on the computing platform 21 of the convolutional neural network computing device 20 shown in FIG. 2 ; wherein , the logic control unit 71 is configured to: split the weight matrix of the convolution layer into multiple weight segments by row, and cache the multiple weight segments to multiple computing units (PE) in the computing array respectively; and , read the multiple input data streams corresponding to the multiple weight segments respectively, and input the multiple input data streams into the multi-row computing unit (PE) in parallel, wherein the input data stream is based on the multiple input data streams in the input feature map of the convolution layer.
  • each computing unit inside each computing unit (PE), the cache unit is configured to cache the weight segment; the control unit is configured to perform sliding window operation on the input data stream based on the cached weight segment; multiply-accumulate unit is configured to perform multiply-accumulate operations.
  • the logic control unit 71 is configured to: for each weight segment, determine multiple rows of data in the input feature map according to the first convolution step size of the convolution layer, and sequentially read multiple rows of data. The row data is spliced to form the input data stream corresponding to each weight segment.
  • control unit within each computation unit (PE), the control unit is configured to use the buffered weight segment as a sliding window and the second convolution stride of the convolutional layer as a sliding step long, a sliding window operation is performed on the input data stream input to each computing unit (PE); the multiply-accumulate unit is configured to: perform a multiply-accumulate operation according to the weight segment and the data in the window.
  • the logic control unit is configured to: split each weight segment into multiple groups; During the period of time, different groups of each weight segment are cached to the corresponding computing unit (PE), so that each computing unit (PE) repeatedly performs sliding window operation and convolution operation on the input data stream based on the currently cached grouping, Thereby, different output feature sub-maps are obtained in different time periods; and, the obtained output feature sub-maps are superimposed.
  • the logic control unit is configured to: determine an index offset value of the sliding window operation according to the group currently buffered by each computing unit (PE), and the index offset value is used to indicate the initial sliding window. the location of the window.
  • PE computing unit
  • the devices and methods provided in the embodiments of the present application are in a one-to-one correspondence. Therefore, the devices also have similar beneficial technical effects to the corresponding methods. Since the beneficial technical effects of the methods have been described in detail above, they will not be repeated here. Beneficial technical effect of the device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

一种卷积神经网络的加速方法及装置,其中该方法包括:将卷积层的权重矩阵按行拆分为多个权重段,并将多个权重段分别缓存至计算单元阵列中的多个计算单元(步骤301);读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元(步骤302),其中,输入数据流是根据卷积层的输入特征图中的多行数据拼接组成;在每个计算单元内部,基于缓存的权重段对输入数据流进行滑窗操作和乘累加运算,得到卷积层的输出特征图(步骤303)。利用上述方法,进入每一行计算单元的数据是按整行数据连续读取,不需要跨行或者跨列读取,因此无需对内存的排布进行特殊设计,支持不同尺寸的卷积,并且无需实现Im2col的功能,降低了复杂度。

Description

一种卷积神经网络的加速方法及装置 技术领域
本发明属于深度学习领域,具体涉及一种卷积神经网络的加速方法及装置。
背景技术
本部分旨在为权利要求书中陈述的本发明的实施方式提供背景或上下文。此处的描述不因为包括在本部分中就承认是现有技术。
卷积神经网络(Convolutional Neural Networks,简称CNN)是一种深度前馈人工神经网络,已被应用于诸多领域,如图像识别。卷积神经网络在处理过程中会进行较为复杂的计算,主要包括卷积计算、批标准化计算、激活计算等。目前大部分神经网络芯片(NPU)主要就是要解决CNN的计算问题,对CNN的计算进行加速。
现有技术中,常用的方法是利用Im2col函数优化卷积运算。如图1所示,在CNN学习训练过程中,通常不会一次处理整张图片,而是先将图片划分为多个小的块(patch),并且每个patch需要经过Im2col处理进行重排,将三维的patch展开成为一维向量,进而可以将卷积操作转换为二维矩阵乘法:C=D×W,其中D为输入图像矩阵,W为权重矩阵。
在上述方案中,计算一个卷积需要同时访问多行和多列的数据,以3*3的卷积为例的话,一次卷积需要的9个数分布在3行和3列,应当理解,数据读取只有针对连续的数据才能确保数据读取的带宽,如需同时访问到上述9个数需要对内部存储器的排布做特殊设计,例如通过内部存储器切分提高内存访问的并行度。然而,由于NPU通常需要支持不同尺寸的卷积,因此若需要实现针对不同的卷积核的通用性,则需要把内存切分成很多小块才能兼容各种设置,这样一方面会增大内部存储器的面积,另一方面提升了数据访问逻辑的复杂度。 因此,设计出一种具有高通用性、低复杂度的卷积神经网络的加速方法是当前亟需解决的技术问题。
发明内容
针对上述现有技术的卷积运算的通用性较差且复杂度高的问题。本发明实施例提出了一种卷积神经网络的加速方法及装置。利用这种方法及装置,能够解决上述问题。
本发明的实施例中提供了以下方案。
第一方面,提供一种卷积神经网络的加速方法,方法包括:将卷积层的权重矩阵按行拆分为多个权重段,并将多个权重段分别缓存至计算单元阵列中的多个计算单元;读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元,其中,输入数据流是根据卷积层的输入特征图中的多行数据拼接组成;在每个计算单元内部,基于缓存的权重段对输入数据流进行滑窗操作和乘累加运算,得到卷积层的输出特征图。
在一种可能的实施方式中,读取多个权重段分别对应的多路输入数据流,还包括:针对每个权重段,根据卷积层的卷积步长确定输入特征图中的多行数据,并依次读取多行数据以拼接组成每个权重段对应的输入数据流。
在一种可能的实施方式中,基于缓存的权重段对输入数据流进行滑窗操作和乘累加运算,还包括:在每个计算单元内部,将缓存的权重段作为滑动窗,并将卷积层的卷积步长作为滑动步长,对输入每个计算单元的输入数据流进行滑窗操作,并根据权重段和窗内数据进行乘累加运算。
在一种可能的实施方式中,若每个计算单元的缓存空间小于权重矩阵的整行长度,方法还包括:将每个权重段拆分为多个分组;在不同的时间段内,分别缓存每个权重段的不同分组至对应的计算单元,使每个计算单元基于当前缓存的分组对输入数据流重复进行滑窗操作和卷积运算,从而在不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
在一种可能的实施方式中,方法还包括:根据每个计算单元当前缓存的分组,确定滑窗操作的索引偏移值,索引偏移值用于指示初始滑窗的位置。
第二方面,提供一种卷积神经网络的加速装置,包括:逻辑控制单元和计算单元阵列,其中每个计算单元包括:缓存单元、控制单元和乘累加单元;其中,逻辑控制单元被配置为用于:将卷积层的权重矩阵按行拆分为多个权重段,并将多个权重段分别缓存至计算阵列中的多个计算单元;以及,读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元,其中,输入数据流是根据卷积层的输入特征图中的多行数据拼接组成;在每个计算单元内部,缓存单元被配置为用于缓存权重段;控制单元被配置为用于基于缓存的权重段对输入数据流进行滑窗操作;乘累加单元被配置为用于执行乘累加运算。
在一种可能的实施方式中,逻辑控制单元被配置为用于:针对每个权重段,根据卷积层的第一卷积步长确定输入特征图中的多行数据,并依次读取多行数据以拼接组成每个权重段对应的输入数据流。
在一种可能的实施方式中,在每个计算单元内部,控制单元被配置为用于:将缓存的权重段作为滑动窗,并将卷积层的第二卷积步长作为滑动步长,对输入每个计算单元的输入数据流进行滑窗操作;乘累加单元被配置为用于:根据权重段和窗内数据进行乘累加运算。
在一种可能的实施方式中,若每个计算单元的缓存空间小于权重矩阵的整行长度,逻辑控制单元被配置为用于:将每个权重段拆分为多个分组;在不同的时间段内,分别缓存每个权重段的不同分组至对应的计算单元,使每个计算单元基于当前缓存的分组对输入数据流重复进行滑窗操作和卷积运算,从而在不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
在一种可能的实施方式中,逻辑控制单元被配置为用于:根据每个计算单元当前缓存的分组,确定滑窗操作的索引偏移值,索引偏移值用于指示初始滑窗的位置。
本申请实施例采用的上述至少一个技术方案能够达到以下有益效果:无需使用Im2col函数,通过对卷积层的权重矩阵进行拆分,将拆分获得的权重段缓存在各计算单元中,利用输入特征图中的多行数据拼接组成各行计算单元的输入数据流,由各个计算单元基于缓存的权重段在输入数据流上进行滑窗操作和乘累加运算实现加速卷积运算,通过这种方案,进入各行计算单元的数据是按整行数据连续读取,不需要跨行或者跨列读取,因此无需对内存的排布进行特殊设计,可支持不同尺寸的卷积运算,并且不需要单独实现Im2col的功能,降低了复杂度。
应当理解,上述说明仅是本发明技术方案的概述,以便能够更清楚地了解本发明的技术手段,从而可依照说明书的内容予以实施。为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举例说明本发明的具体实施方式。
附图说明
通过阅读下文的示例性实施例的详细描述,本领域普通技术人员将明白本文所述的优点和益处以及其他优点和益处。附图仅用于示出示例性实施例的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的标号表示相同的部件。在附图中:
图1为现有技术中基于Im2col的卷积运算的示意图;
图2为一种卷积神经网络计算装置的结构示意图;
图3为根据本发明一实施例的卷积神经网络的加速方法的流程示意图;
图4为三维CNN卷积示意图;
图5为根据本发明一实施例的利用缓存有权重段的计算单元(PE)进行卷积的示意图;
图6为根据本发明一实施例的利用缓存有权重段的计算单元(PE)进行滑窗操作的示意图;
图7为根据本发明一实施例的卷积神经网络的加速装置的结构示意图;
图8为根据本发明一实施例的卷积神经网络的加速装置的计算单元(PE)的结构示意图。
在附图中,相同或对应的标号表示相同或对应的部分。
具体实施方式
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
在本发明中,应理解,诸如“包括”或“具有”等术语旨在指示本说明书中所公开的特征、数字、步骤、行为、部件、部分或其组合的存在,并且不旨在排除一个或多个其他特征、数字、步骤、行为、部件、部分或其组合存在的可能性。
另外还需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
图2示出了一种卷积神经网络计算装置的结构示意图20。其包括:计算平台21和外部存储器22,计算平台21至少包括用于执行卷积计算的计算单元(PE)阵列211和内部存储器212,其中,外部存储器22通常选用低成本的存储介质,通常带宽受限且读写功耗也较高。内部存储器通常选用访问速度较快的存储介质,比如SRAM,其带宽较高,读写代价也较低,但通常成本较高,因此一般容量受限。
图3示出了根据本发明实施例的卷积神经网络的加速方法300的流程示意图。以下结合图2所示出的卷积神经网络计算装置,对图3中卷积神经网络的加速方法300的各个方面进行详细阐述。
如图3所示,该方法300可以包括:
步骤301:将卷积层的权重矩阵按行拆分为多个权重段,并将多个权重段分别缓存至计算单元(PE)阵列中的多个计算单元(PE);
步骤302:读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元(PE),
步骤303:在每个计算单元(PE)内部,基于缓存的权重段对输入数据流进行滑窗操作和乘累加运算,得到卷积层的输出特征图。
图4是一种三维CNN卷积示意图,其中,针对卷积神经网络的任意一个卷积层,假设该卷积层的输入特征图是由若干个D abc组成的矩阵数据,其尺寸为6(列)*6(行)*N(输入通道数),其中,D abc的下标a表示输入通道序号,取值0,1,…,(N-1);下标b表示行序号,取值为0,1,…,5;下标c表示列序号,取值0,1,…,5;该输入特征图通常在外部存储器22上逐行逐通道地依次连续存储,其中,N为大于2的正整数。该卷积层的权重矩阵是由若干个W mnpq组成的数据,通常在外部存储器22上存储,其尺寸为3(列)*3(行)*N(输入通道数)*2(输出通道数),其中,W mnpq的下标n表示输入通道序号,取值为0,1,…,(N-1);下标p表示行序号,取值为0,1,2;下标q表示列序号,取值0,1,2;下标m表示输出通道序号,取值0,1,分别对应图4中的过滤器0和过滤器1。输出特征图是由若干个P xyz组成的矩阵数据,其尺寸为4(列)*4(行)*2(输出通道数),其中,P xyz的下标x表示输出通道序号,取值0,1;下标y表示行序号,取值为0,1,2,3;下标z表示列序号,取值0,1,2,3;本实施例中,希望实现的卷积运算也即由输入特征图和权重矩阵映射出输出特征图的操作。
本发明实施例中,以图4所示出的三维CNN卷积为例,首先,如步骤S301,可以从外部存储器22中读取卷积层的权重矩阵进入内部存储器212,进而将权重矩阵按行进行拆分以获得多个权重段。例如,可以将图4所示出的权重矩阵拆分为3*N*2个权重段,每个权重段对应于权重矩阵的一行权重值,尺寸为3(列)*1(行)*1(输入通道数)*1(输出通道数);进一步,图5为本实施例的利用缓存有权重段的计算单元(PE)进行卷积的示意图,如图5所示,将多个权重段分别缓存至计算单元(PE)阵列中的多个计算单元(PE),比如将 (W 0000,W 0001,W 0002)缓存至一个计算单元(PE),将(W 0010,W 0011,W 0012)缓存至另一个计算单元(PE),依次类推。
其中,如图5所示,该权重矩阵中,对应于同一输出通道的多个权重段被缓存在计算单元(PE)阵列的同一列中,比如两列计算单元(PE)可以分别用于缓存过滤器0和过滤器1中的各个权重段。进一步地,若不同的权重段对应于相同的输入数据流,则将该不同的权重段排列在计算单元(PE)阵列的同一行。
进一步,如步骤302,读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元(PE)。
其中,基于权重矩阵和输入特征图进行卷积运算而确定的多个卷积窗中,每个权重段所面临的数据D abc是不同的,以卷积步长k=2为例,权重段(W 0000,W 0001,W 0002)只需要和行序号b=0,2的输入数据进行卷积运算。因此针对每个权重段,可以针对性地确定输入特征图中的多行数据以拼接组成对应每个权重段的输入数据流,在卷积计算过程中,读取多个权重段分别对应的多路输入数据流以并行输入至多行计算单元(PE)。
在一些可能的实施方式中,上述步骤302还包括:针对每个权重段,根据卷积层的卷积步长确定输入特征图中的多行数据,并依次读取多行数据以拼接组成每个权重段对应的输入数据流。例如,假设卷积层的步长为k,那么权重段(W 0000,W 0001,W 0002)对应的输入数据流由图4的输入特征图的对应输入通道内的行0,行k,…,行(E-1)k的数据拼接组成,假设k=1,也即(D 000,D 001,D 002,D 003,D 004,D 005,D 0,10,D 011,...);权重段(W 0010,W 0011,W 0012)对应的输入数据流由图4中的输入特征图的对应输入通道内的行1,行k+1,…,行(E-1)k+1的数据拼接组成,权重段(W 0020,W 0021,W 0022)对应的输入数据流由图4中的输入特征图的对应输入通道内的行2,行k+2,…,行(E-1)k+2的数据拼接组成,依次类推。
进一步,如步骤303,在每个计算单元(PE)内部,基于缓存的权重段对输入数据流进行滑窗操作和乘累加运算,得到卷积层的输出特征图。
在一些可能的实施方式中,上述步骤303还包括:在每个计算单元(PE)内部,将缓存的权重段作为滑动窗,并将卷积层的卷积步长作为滑动步长,对输入每个计算单元(PE)的输入数据流进行滑窗操作,并根据权重段和窗内数据进行乘累加运算。例如,图6示出了利用缓存有权重段的计算单元(PE)进行滑窗操作的示意图,如图6所示,针对权重段(W 0000,W 0001,W 0002),其对应的输入数据流为图4的输入特征图D abc中的输入通道序号a=0、行序号b=0,k,…,(E-1)k的数据,也即(D 000,D 001,D 002,D 003,D 004,D 005,D 0,10,D 011,...);基于此,在时序T 1,权重段(W 0000,W 0001,W 0002)在输入数据流上进行滑动至(D 000,D 001,D 002),并基于窗内的权重段和输入数据执行乘加操作:W 0000×D 000+W 0001×D 001+W 0002×D 002;与此同时,其他计算单元(PE)也基于缓存的权重段在其对应的输入数据流上进行同步滑动,以及基于权重段和窗内数据执行乘加操作,比如,权重段(W 0010,W 0011,W 0012)滑动至(D 010,D 011,D 012),权重段(W 0020,W 0021,W 0022)滑动至(D 020,D 021,D 022),其他输入通道的权重段依次类推,可以获得每个计算单元(PE)在时序T 1处的乘累加结果,进而可以将对应于过滤器0的一列计算单元(PE)在时序T 1处的全部乘累加结果相加得到一个输出通道的输出值P 000,将对应于过滤器1的一列计算单元(PE)在时序T 1处的全部乘累加结果相加得到另一个输出通道的输出值P 100;在时序T 2,权重段(W 0000,W 0001,W 0002)在输入数据流上滑动k个值,假设k=1,则滑动至(D 001,D 002,D 003),基于窗内的权重段和输入数据执行乘加操作:W 0000×D 001+W 0001×D 002+W 0002×D 003…,依次类推。最终可以得到图4中的输出特征图中的每个值。
本实施例中,通过采用上述方法,可以从外部存储器中整行地读取输入特征图中的数据,并数据流的形式将数据并行输入每行计算单元(PE),无需同时对外部存储器上基于单个存储方向上连续存储的输入特征图进行频繁的跨行或跨列读取,因此无需对内存的排布进行特殊设计,即可支持不同尺寸的卷积运算。此外,也无需在计算平台中额外实现Im2col的功能,节省了硬件资源和运算功耗。
在一些可能的实施方式中,若每个计算单元(PE)的缓存空间小于权重矩阵的整行长度,方法还包括:将每个权重段拆分为多个分组;在不同的时间段内,分别缓存每个权重段的不同分组至对应的计算单元(PE),使每个计算单元(PE)基于当前缓存的分组对输入数据流重复进行滑窗操作和卷积运算,从而在不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
例如,假设卷积层的权重矩阵的尺寸为10(列)*10(行)*N(输入通道数)*2(输出通道数),因此单个权重段的尺寸为10(列)*1(行)*1(输入通道数)*1(输出通道数),若计算单元(PE)的缓存空间有限,比如每个计算单元(PE)只能缓存5个权重值,在这种情况下,本实施方式可以将每个权重段拆分为多个分组,比如,假设一个权重段为(W 0000,W 0001,...,W 0009),可以分为第一分组:(W 0000,W 0001,...,W 0004)和第二分组:(W 0005,W 0006,...,W 0009)。进而可以在第一时间段内,先从外部存储器读取多个权重段的第一分组,并缓存到对应的计算单元(PE),从外部存储器读取每个权重段对应的输入数据流并输入对应的计算单元(PE),每个计算单元(PE)基于缓存的第一分组对输入数据流进行第一滑窗操作和乘累加运算,得到第一输出特征子图。在第一分组涉及的运算执行完毕之后,在第二时间段内,从外部存储器读取多个权重段的第二分组,缓存到对应的计算单元(PE)以替换原有的第一分组,从外部存储器重复读取每个权重段对应的输入数据流并输入对应的计算单元(PE),每个计算单元(PE)基于缓存的第二分组对输入数据流进行第二滑窗操作和乘累加运算,得到第二输出特征子图;进一步,将该第一输出特征子图和第二输出特征子图进行矩阵累加,输出该卷积层的输出特征图。
本实施方式中,通过对权重段进行分组,无需扩大片内缓存空间即可适用于权重数据较大的卷积,进一步提升了针对各自尺寸卷积运算的通用性。
在一些可能的实施方式中,还包括:根据每个计算单元(PE)当前缓存的分组,确定滑窗操作的索引偏移值,索引偏移值用于指示初始滑窗的位置。
其中,由于在对权重段进行拆分之后,每个分组在进行滑窗操作时在输入数据流上的滑动起始位置不同,比如,第一分组:(W 0000,W 0001,...,W 0004)在输入数据流中的每行数据中的滑动起始位置无需偏移,第二分组:(W 0005,W 0006,...,W 0009)在输入数据流中的每行数据中的滑动起始位置则需偏移5个值,该5个值为每个权重段中第二分组之前的值个数。其中,针对均匀分组的情况,该索引偏移值可以为s*L,其中s用于指示分组的序号,取值0,1,2,…,其中L用于指示分组的尺寸。
基于相同或类似的技术构思,本发明实施例还提供一种卷积神经网络的加速装置,其包括:包括:逻辑控制单元和计算单元(PE)阵列,其中每个计算单元(PE)包括:缓存单元、控制单元和乘累加(MAC)单元;
在本实施例中,如图7和图8所示,逻辑控制单元71和计算单元(PE)阵列211可以设置于图2所示出的卷积神经网络计算装置20的计算平台21上;其中,逻辑控制单元71被配置为用于:将卷积层的权重矩阵按行拆分为多个权重段,并将多个权重段分别缓存至计算阵列中的多个计算单元(PE);以及,读取多个权重段分别对应的多路输入数据流,并将多路输入数据流并行输入至多行计算单元(PE),其中,输入数据流是根据卷积层的输入特征图中的多行数据拼接组成;在每个计算单元(PE)内部,缓存单元被配置为用于缓存权重段;控制单元被配置为用于基于缓存的权重段对输入数据流进行滑窗操作;乘累加单元被配置为用于执行乘累加运算。
在一些可能的实施方式中,逻辑控制单元71被配置为用于:针对每个权重段,根据卷积层的第一卷积步长确定输入特征图中的多行数据,并依次读取多行数据以拼接组成每个权重段对应的输入数据流。
在一些可能的实施方式中,在每个计算单元(PE)内部,控制单元被配置为用于:将缓存的权重段作为滑动窗,并将卷积层的第二卷积步长作为滑动步长,对输入每个计算单元(PE)的输入数据流进行滑窗操作;乘累加单元被配置为用于:根据权重段和窗内数据进行乘累加运算。
在一些可能的实施方式中,若每个计算单元(PE)的缓存空间小于权重矩阵的整行长度,逻辑控制单元被配置为用于:将每个权重段拆分为多个分组;在不同的时间段内,分别缓存每个权重段的不同分组至对应的计算单元(PE),使每个计算单元(PE)基于当前缓存的分组对输入数据流重复进行滑窗操作和卷积运算,从而在不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
在一种可能的实施方式中,逻辑控制单元被配置为用于:根据每个计算单元(PE)当前缓存的分组,确定滑窗操作的索引偏移值,索引偏移值用于指示初始滑窗的位置。
本申请中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置而言,由于其基本相似于方法实施例,所以其描述进行了简化,相关之处可参见方法实施例的部分说明即可。
本申请实施例提供的装置与方法是一一对应的,因此,装置也具有与其对应的方法类似的有益技术效果,由于上面已经对方法的有益技术效果进行了详细说明,因此,这里不再赘述装置的有益技术效果。
虽然已经参考若干具体实施方式描述了本发明的精神和原理,但是应该理解,本发明并不限于所公开的具体实施方式,对各方面的划分也不意味着这些方面中的特征不能组合以进行受益,这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。
虽然已经参考若干具体实施方式描述了本发明的精神和原理,但是应该理解,本发明并不限于所公开的具体实施方式,对各方面的划分也不意味着这些方面中的特征不能组合以进行受益,这种划分仅是为了表述的方便。本发明旨在涵盖所附权利要求的精神和范围内所包括的各种修改和等同布置。

Claims (10)

  1. 一种卷积神经网络的加速方法,其特征在于,所述方法包括:
    将卷积层的权重矩阵按行拆分为多个权重段,并将所述多个权重段分别缓存至计算单元阵列中的多个计算单元;
    读取所述多个权重段分别对应的多路输入数据流,并将所述多路输入数据流并行输入至多行计算单元,其中,所述输入数据流是根据所述卷积层的输入特征图中的多行数据拼接组成;
    在每个所述计算单元内部,基于缓存的所述权重段对所述输入数据流进行滑窗操作和乘累加运算,得到所述卷积层的输出特征图。
  2. 根据权利要求1所述的方法,其特征在于,读取所述多个权重段分别对应的多路输入数据流,还包括:
    针对所述每个权重段,根据所述卷积层的卷积步长确定所述输入特征图中的多行数据,并依次读取所述多行数据以拼接组成所述每个权重段对应的输入数据流。
  3. 根据权利要求1所述的方法,其特征在于,基于缓存的所述权重段对所述输入数据流进行滑窗操作和乘累加运算,还包括:
    在每个所述计算单元内部,将缓存的所述权重段作为滑动窗,并将所述卷积层的卷积步长作为滑动步长,对输入每个所述计算单元的输入数据流进行滑窗操作,并根据所述权重段和窗内数据进行乘累加运算。
  4. 根据权利要求1所述的方法,其特征在于,若每个所述计算单元的缓存空间小于所述权重矩阵的整行长度,所述方法还包括:
    将所述每个权重段拆分为多个分组;在不同的时间段内,分别缓存所述每个权重段的不同分组至对应的计算单元,使每个计算单元基于当前缓存的分组对所述输入数据流重复进行所述滑窗操作和所述卷积运算,从而在所述不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
  5. 根据权利要求4所述的方法,其特征在于,还包括:
    根据每个计算单元当前缓存的分组,确定所述滑窗操作的索引偏移值,所述索引偏移值用于指示初始滑窗的位置。
  6. 一种卷积神经网络的加速装置,其特征在于,包括:计算平台和外部存储器,所述计算平台至少包括用于执行卷积计算的计算单元阵列、内部存储器以及逻辑控制单元,其中所述计算单元阵列的每个计算单元包括:缓存单元、控制单元和乘累加单元;
    所述逻辑控制单元被配置为用于:将卷积层的权重矩阵按行拆分为多个权重段,并将所述多个权重段分别缓存至计算阵列中的多个计算单元;以及,读取所述多个权重段分别对应的多路输入数据流,并将所述多路输入数据流并行输入至多行计算单元,其中,所述输入数据流是根据所述卷积层的输入特征图中的多行数据拼接组成;
    在每个所述计算单元内部,所述缓存单元被配置为用于缓存所述权重段;所述控制单元被配置为用于基于缓存的所述权重段对所述输入数据流进行滑窗操作;所述乘累加单元被配置为用于执行乘累加运算。
  7. 根据权利要求6所述的装置,其特征在于,所述逻辑控制单元被配置为用于:
    针对所述每个权重段,根据所述卷积层的第一卷积步长确定所述输入特征图中的多行数据,并依次读取所述多行数据以拼接组成所述每个权重段对应的输入数据流。
  8. 根据权利要求6所述的装置,其特征在于,在每个所述计算单元内部,所述控制单元被配置为用于:将缓存的所述权重段作为滑动窗,并将所述卷积层的第二卷积步长作为滑动步长,对输入每个所述计算单元的输入数据流进行滑窗操作;所述乘累加单元被配置为用于:根据所述权重段和窗内数据进行乘累加运算。
  9. 根据权利要求6所述的装置,其特征在于,若每个所述计算单元的缓存空间小于所述权重矩阵的整行长度,所述逻辑控制单元被配置为用于:
    将所述每个权重段拆分为多个分组;在不同的时间段内,分别缓存所述每个权重段的不同分组至对应的计算单元,使每个计算单元基于当前缓存的分组对所述输入数据流重复进行所述滑窗操作和所述卷积运算,从而在所述不同的时间段内获得不同的输出特征子图;以及,将获得的输出特征子图叠加。
  10. 根据权利要求9所述的装置,其特征在于,所述逻辑控制单元被配置为用于:
    根据每个计算单元当前缓存的分组,确定所述滑窗操作的索引偏移值,所述索引偏移值用于指示初始滑窗的位置。
PCT/CN2020/126196 2020-07-08 2020-11-03 一种卷积神经网络的加速方法及装置 WO2022007266A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/015,308 US20230289230A1 (en) 2020-07-08 2020-11-03 Method and apparatus for accelerating convolutional neural network

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010652622.8 2020-07-08
CN202010652622.8A CN113919477A (zh) 2020-07-08 2020-07-08 一种卷积神经网络的加速方法及装置

Publications (1)

Publication Number Publication Date
WO2022007266A1 true WO2022007266A1 (zh) 2022-01-13

Family

ID=79231616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/126196 WO2022007266A1 (zh) 2020-07-08 2020-11-03 一种卷积神经网络的加速方法及装置

Country Status (3)

Country Link
US (1) US20230289230A1 (zh)
CN (1) CN113919477A (zh)
WO (1) WO2022007266A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202071A (zh) * 2022-02-17 2022-03-18 浙江光珀智能科技有限公司 一种基于数据流模式的深度卷积神经网络推理加速方法
CN114565501A (zh) * 2022-02-21 2022-05-31 格兰菲智能科技有限公司 用于卷积运算的数据加载方法及其装置
CN114595748A (zh) * 2022-02-21 2022-06-07 南昌大学 一种用于跌倒防护系统的数据分割方法
CN115600652A (zh) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) 卷积神经网络处理装置、高速目标探测方法以及设备
CN116306823A (zh) * 2023-04-27 2023-06-23 北京爱芯科技有限公司 为mac阵列提供数据的方法、装置和芯片
CN117009859A (zh) * 2023-09-26 2023-11-07 深圳市魔数智擎人工智能有限公司 一种基于内存计算的特征拼接方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114254740B (zh) * 2022-01-18 2022-09-30 长沙金维信息技术有限公司 卷积神经网络加速计算方法、计算系统、芯片及接收机

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108537330A (zh) * 2018-03-09 2018-09-14 中国科学院自动化研究所 应用于神经网络的卷积计算装置及方法
CN108629406A (zh) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 用于卷积神经网络的运算装置
CN109272113A (zh) * 2018-09-13 2019-01-25 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
CN110705687A (zh) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 卷积神经网络硬件计算装置及方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3444757B1 (en) * 2016-04-15 2021-07-07 Cambricon Technologies Corporation Limited Discrete data representation supported device and method for forward operation of artificial neural network
WO2018214913A1 (zh) * 2017-05-23 2018-11-29 上海寒武纪信息科技有限公司 处理方法及加速装置
US10678508B2 (en) * 2018-03-23 2020-06-09 Amazon Technologies, Inc. Accelerated quantized multiply-and-add operations
CN109543830B (zh) * 2018-09-20 2023-02-03 中国科学院计算技术研究所 一种用于卷积神经网络加速器的拆分累加器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108629406A (zh) * 2017-03-24 2018-10-09 展讯通信(上海)有限公司 用于卷积神经网络的运算装置
CN108537330A (zh) * 2018-03-09 2018-09-14 中国科学院自动化研究所 应用于神经网络的卷积计算装置及方法
CN109272113A (zh) * 2018-09-13 2019-01-25 深思考人工智能机器人科技(北京)有限公司 一种卷积神经网络的建立装置及方法
CN110705687A (zh) * 2019-09-05 2020-01-17 北京三快在线科技有限公司 卷积神经网络硬件计算装置及方法

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114202071A (zh) * 2022-02-17 2022-03-18 浙江光珀智能科技有限公司 一种基于数据流模式的深度卷积神经网络推理加速方法
CN114202071B (zh) * 2022-02-17 2022-05-27 浙江光珀智能科技有限公司 一种基于数据流模式的深度卷积神经网络推理加速方法
CN114565501A (zh) * 2022-02-21 2022-05-31 格兰菲智能科技有限公司 用于卷积运算的数据加载方法及其装置
CN114595748A (zh) * 2022-02-21 2022-06-07 南昌大学 一种用于跌倒防护系统的数据分割方法
CN114595748B (zh) * 2022-02-21 2024-02-13 南昌大学 一种用于跌倒防护系统的数据分割方法
CN114565501B (zh) * 2022-02-21 2024-03-22 格兰菲智能科技有限公司 用于卷积运算的数据加载方法及其装置
CN115600652A (zh) * 2022-11-29 2023-01-13 深圳市唯特视科技有限公司(Cn) 卷积神经网络处理装置、高速目标探测方法以及设备
CN116306823A (zh) * 2023-04-27 2023-06-23 北京爱芯科技有限公司 为mac阵列提供数据的方法、装置和芯片
CN116306823B (zh) * 2023-04-27 2023-08-04 北京爱芯科技有限公司 为mac阵列提供数据的方法、装置和芯片
CN117009859A (zh) * 2023-09-26 2023-11-07 深圳市魔数智擎人工智能有限公司 一种基于内存计算的特征拼接方法及系统
CN117009859B (zh) * 2023-09-26 2024-01-09 深圳市魔数智擎人工智能有限公司 一种基于内存计算的特征拼接方法及系统

Also Published As

Publication number Publication date
CN113919477A (zh) 2022-01-11
US20230289230A1 (en) 2023-09-14

Similar Documents

Publication Publication Date Title
WO2022007266A1 (zh) 一种卷积神经网络的加速方法及装置
CN109886400B (zh) 基于卷积核拆分的卷积神经网络硬件加速器系统及其计算方法
Qiao et al. AtomLayer: A universal ReRAM-based CNN accelerator with atomic layer computation
WO2021109699A1 (zh) 人工智能加速器、设备、芯片及数据处理方法
CN111445012B (zh) 一种基于fpga的分组卷积硬件加速器及其方法
US11775430B1 (en) Memory access for multiple circuit components
CN109409511B (zh) 一种用于动态可重构阵列的卷积运算数据流调度方法
WO2022007265A1 (zh) 一种膨胀卷积加速计算方法及装置
CN106846235B (zh) 一种利用NVIDIA Kepler GPU汇编指令加速的卷积优化方法及系统
CN110222818B (zh) 一种用于卷积神经网络数据存储的多bank行列交织读写方法
CN110738308B (zh) 一种神经网络加速器
CN112668708B (zh) 一种提高数据利用率的卷积运算装置
CN110580519B (zh) 一种卷积运算装置及其方法
CN112905530A (zh) 片上架构、池化计算加速器阵列、单元以及控制方法
CN111340198A (zh) 基于fpga的数据高度复用的神经网络加速器
JP7332722B2 (ja) データ処理方法、装置、記憶媒体及び電子機器
CN115423081A (zh) 一种基于fpga的cnn_lstm算法的神经网络加速器
CN110414672B (zh) 卷积运算方法、装置及系统
WO2022062391A1 (zh) 一种加速rnn网络的系统、方法及存储介质
CN110766136B (zh) 一种稀疏矩阵与向量的压缩方法
CN111860819A (zh) 一种可拼接、可分段的全连接神经网络推理加速器及其加速方法
CN111191774B (zh) 面向精简卷积神经网络的低代价加速器架构及其处理方法
CN113610221B (zh) 一种基于fpga的可变膨胀卷积运算硬件系统
CN115204373A (zh) 一种卷积神经网络的快速卷积及缓存模式的设计方法
CN113627587A (zh) 一种多通道式卷积神经网络加速方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20944043

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20944043

Country of ref document: EP

Kind code of ref document: A1