WO2020119318A1 - Self-adaptive selection and design method for convolutional-layer hardware accelerator - Google Patents

Self-adaptive selection and design method for convolutional-layer hardware accelerator Download PDF

Info

Publication number
WO2020119318A1
WO2020119318A1 PCT/CN2019/114910 CN2019114910W WO2020119318A1 WO 2020119318 A1 WO2020119318 A1 WO 2020119318A1 CN 2019114910 W CN2019114910 W CN 2019114910W WO 2020119318 A1 WO2020119318 A1 WO 2020119318A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
input
convolutional layer
accelerator
channels
Prior art date
Application number
PCT/CN2019/114910
Other languages
French (fr)
Chinese (zh)
Inventor
秦华标
曹钦平
Original Assignee
华南理工大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华南理工大学 filed Critical 华南理工大学
Publication of WO2020119318A1 publication Critical patent/WO2020119318A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • convolutional neural networks have been widely used in image classification, target detection, and language recognition.
  • convolutional neural networks also require more computing resources and memory resources, which also leads to many applications based on convolutional neural networks must rely on large servers.
  • the application of deep learning technologies such as convolutional neural networks has become the general trend.
  • Convolutional neural networks usually contain a large number of convolutional layers that can be calculated in parallel, so the design of hardware accelerators for convolutional layers is an inevitable development direction in the future.
  • the fourth kind there are many input channels and many output channels.
  • the structure must belong to the fourth convolutional layer structure, and the fourth acceleration scheme is preferred.
  • the optimal accelerator scheme is selected to generate the corresponding convolution layer accelerator.
  • the convolutional layer parameters include weights and offsets, and the parameters are converted into hardware format data files and stored in the memory;
  • the convolutional layer structure includes the number of input channels of the input feature map, the width of the input feature map, the input
  • the height of the feature map and the number of convolution kernels are the number of output channels, the width of the convolution kernel, the height of the convolution kernel, the width step of the convolution kernel and the height step of the convolution kernel.
  • FIG. 6 is a schematic diagram of a convolution kernel pipeline in an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of an adaptive accelerator design process in an embodiment of the present invention.
  • Step 4 As shown in Figure 3, x (g) and w m ⁇ are obtained by (5)(9) Get in (6) This process calculates N input channels in parallel, which is called input channel parallel, and encapsulates the structure of Figure 3 as an input channel parallel module.
  • the input of this module is x (g) , w m ⁇ and b m , and the output is

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

Disclosed is a self-adaptive selection and design method for a convolutional-layer hardware accelerator, comprising the following steps: (1) analyzing convolutional layer structures, designing four different hardware accelerator solutions for different kinds of convolutional layer structures, and storing the four different hardware accelerator solutions in an accelerator solution pool; and (2) obtaining a convolutional layer structure and a convolutional layer parameter from an input source, selecting, according to the convolutional layer structure, an optimal accelerator solution from the accelerator solution pool, and constructing a corresponding convolutional-layer accelerator on the basis of the optimal accelerator solution. The invention is employed to design a solution pool of convolution-layer accelerators, self-adaptively select an optimal solution, and generate a hardware accelerator, thereby enabling more flexible hardware design, while also reducing resource consumption and increasing parallel operation speeds of convolution layers.

Description

一种自适应卷积层硬件加速器设计方法An Adaptive Convolutional Layer Hardware Accelerator Design Method 技术领域Technical field
本发明涉及卷积神经网络硬件加速器设计,属于集成电路硬件加速技术领域,具体是一种根据不同的卷积层结构自适应选择最优硬件加速方案并生成硬件加速器的方法。The invention relates to the design of hardware accelerators for convolutional neural networks and belongs to the technical field of hardware acceleration of integrated circuits, in particular to a method for adaptively selecting an optimal hardware acceleration scheme and generating a hardware accelerator according to different convolutional layer structures.
背景技术Background technique
近年来,卷积神经网络被广泛应用于图像分类、目标检测和语言识别等领域。但是,卷积神经网络在达到了非常高的精度的同时,也需要更多的计算资源与内存资源,也导致许多基于卷积神经网络的应用都必须依赖于大型服务器。在资源受限的嵌入式平台,应用卷积神经网络等深度学习技术已经大势所趋。卷积神经网络中通常包含着大量的可以并行计算的卷积层,所以对卷积层进行硬件加速器设计是未来必然的发展方向。In recent years, convolutional neural networks have been widely used in image classification, target detection, and language recognition. However, while achieving very high accuracy, convolutional neural networks also require more computing resources and memory resources, which also leads to many applications based on convolutional neural networks must rely on large servers. In embedded platforms with limited resources, the application of deep learning technologies such as convolutional neural networks has become the general trend. Convolutional neural networks usually contain a large number of convolutional layers that can be calculated in parallel, so the design of hardware accelerators for convolutional layers is an inevitable development direction in the future.
关于卷积层硬件加速设计,目前研究的主要方向是不考虑卷积层结构,采用相同的硬件电路架构对卷积层进行加速,该方法没有针对不同的结构进行优化,从而造成消耗较多硬件资源的同时也降低了并行计算速度;目前的硬件设计主要是提供硬件接口,对于卷积层这种参数较多、结构复杂的电路灵活性非常差。Regarding the hardware acceleration design of the convolutional layer, the main research direction is not to consider the structure of the convolutional layer. The same hardware circuit architecture is used to accelerate the convolutional layer. This method does not optimize for different structures, resulting in more hardware consumption. At the same time, it reduces the parallel computing speed; the current hardware design is mainly to provide a hardware interface, and the circuit with more parameters and complex structure of the convolution layer is very inflexible.
鉴于目前技术在卷积层硬件加速的局限性,可以针对不同的卷积层结构设计相应的加速器方案,将所有的加速器方案存储区称为加速器方案池,从输入源中获取卷积层结构,然后从加速器方案池中选择最优方案,最后生成硬件加速器。经对现有技术文献的检索发现,尚未有报道过有针对不同卷积层结构设计不同的加速器方案以及使用自适应选择最优方案。In view of the limitations of the current technology in hardware acceleration of convolutional layers, corresponding accelerator solutions can be designed for different convolutional layer structures. All accelerator solution storage areas are called accelerator solution pools, and the convolutional layer structure is obtained from the input source. Then select the optimal solution from the accelerator solution pool, and finally generate a hardware accelerator. After searching the literature of the prior art, it has not been reported that different accelerator schemes are designed for different convolutional layer structures and the optimal scheme is adaptively selected.
发明内容Summary of the invention
本发明克服现有卷积层硬件加速技术中的不足,提出了一种自适应卷积层硬件加速器设计方法。The present invention overcomes the deficiencies in the existing hardware acceleration technology for convolutional layers, and proposes an adaptive convolutional layer hardware accelerator design method.
本发明通过设计四种不同的加速器方案,自适应选取最优方案并生成硬件加速器,不仅提高硬件设计的灵活性,还减少资源消耗并提高运算速度。By designing four different accelerator schemes, the present invention adaptively selects the optimal scheme and generates a hardware accelerator, which not only improves the flexibility of hardware design, but also reduces resource consumption and improves the calculation speed.
本发明的目的至少通过如下技术方案之一实现。The object of the present invention is achieved by at least one of the following technical solutions.
一种自适应卷积层硬件加速器设计方法,该设计包括如下步骤:An adaptive convolutional layer hardware accelerator design method, the design includes the following steps:
(1)对卷积层结构进行分析,针对不同的卷积层结构设计了四种不同的硬件加速器方案,并将四种不同的硬件加速器方案存储在加速器方案池中;(1) Analyze the convolutional layer structure, design four different hardware accelerator solutions for different convolutional layer structures, and store the four different hardware accelerator solutions in the accelerator solution pool;
(2)从输入源中获取卷积层结构与卷积层参数,然后根据卷积层结构从加速器方案池 中选取最优加速器方案,并由加速器方案构建相应的卷积层加速器。(2) Obtain the convolutional layer structure and convolutional layer parameters from the input source, and then select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer accelerator from the accelerator solution.
进一步的,上述步骤(1)中,通过用户指定输入通道个数阈值N i和输出通道个数阈值N o,可以将卷积层结构分为以下四种:输入通道个数小于N i,输出通道个数小于N o;输入通道个数小于N i,输出通道个数大于N o;输入通道数个大于N i,输出通道个数小于N o;输入通道个数大于N i,输出通道个数大于N oFurther, in the above step (1), the user can specify the input channel number threshold N i and the output channel number threshold N o , the convolution layer structure can be divided into the following four types: the number of input channels is less than N i , output The number of channels is less than N o ; the number of input channels is less than N i , the number of output channels is greater than N o ; the number of input channels is greater than N i , the number of output channels is less than N o ; the number of input channels is greater than N i , output channels is greater than the number N o.
所述硬件加速方案为:The hardware acceleration solution is:
并行加速方案一,对输出通道进行并行运算,分别对输入通道和卷积窗口进行流水操作;Parallel acceleration scheme one, parallel operation is performed on the output channel, and pipeline operations are performed on the input channel and the convolution window respectively;
并行加速方案二,对输出通道与输入通道进行并行运算,对卷积窗口进行流水线操作;Parallel acceleration scheme two, parallel operation of the output channel and input channel, and pipeline operation of the convolution window;
并行加速方案三,对输入通道进行并行运算,分别对输出通道和卷积窗口进行流水线Parallel acceleration scheme three, parallel operation of the input channel, and pipeline of the output channel and the convolution window respectively
操作;operating;
并行加速方案四,对部分输入通道与输出通道进行并行运算,分别对部分输入通道和卷积窗口进行流水线操作;Parallel acceleration scheme four, parallel operation is performed on some input channels and output channels, and pipeline operations are performed on some input channels and convolution windows, respectively;
将四种硬件加速器方案存储在存储区中,称为加速器方案池。The four hardware accelerator solutions are stored in the storage area, called the accelerator solution pool.
进一步的,自适应选取最优方案并生成硬件加速器包括以下步骤:首先从输入源中获得卷积层结构与卷积层参数;然后根据卷积层结构从加速器单元池中选择最优的加速器方案;最后由最优加速器方案与卷积层参数生成最终的硬件加速器。Further, adaptively selecting the optimal solution and generating the hardware accelerator include the following steps: first, obtain the convolutional layer structure and convolutional layer parameters from the input source; and then select the optimal accelerator solution from the accelerator unit pool according to the convolutional layer structure ; Finally, the final accelerator is generated by the optimal accelerator solution and the convolution layer parameters.
进一步的,所述的并行加速方案四的过程为将输入通道分成若干个等份,对每一份的若干个输入通道的一个卷积窗口与所有卷积核进行卷积运算;然后对若干份输入通道进行流水线操作,从而得到所有输入通道的一个卷积窗口卷积输出;然后再对卷积窗口进行流水线操作,得到所有输入通道的卷积输出。Further, the process of the parallel acceleration scheme 4 is to divide the input channel into several equal parts, and perform a convolution operation on a convolution window of several input channels of each part and all convolution kernels; The input channels are pipelined to obtain a convolution window convolution output of all input channels; then the convolution window is pipelined to obtain the convolution output of all input channels.
进一步的,获取卷积层参数包括从输入源中获得输入特征图的高和宽,以及输入特征图的输入通道数,获得卷积核的高和宽,卷积核的个数,以及宽步长与高步长;获得输入特征图,权重与偏置的值;由卷积层的参数估算每个加速方案所消耗的硬件资源以及所需要的时钟周期;将这些估算的结果结合用户针对任务所限制的需求来选择出最优的加速器方案,从而生成卷积层硬件加速器。Further, obtaining the convolution layer parameters includes obtaining the height and width of the input feature map from the input source, and the number of input channels of the input feature map, obtaining the height and width of the convolution kernel, the number of convolution kernels, and the wide step Long and high step size; obtain input feature map, weight and offset values; estimate the hardware resources consumed by each acceleration scheme and the required clock cycle from the parameters of the convolution layer; combine these estimated results with the user's task Limited requirements to select the optimal accelerator solution, and thus generate a convolutional layer hardware accelerator.
进一步的,根据输入通道个数与输出通道个数的关系,将卷积层结构分为以下四种:Further, according to the relationship between the number of input channels and the number of output channels, the convolutional layer structure is divided into the following four types:
第一种:输入通道个数少,输出通道个数少;The first type: fewer input channels and fewer output channels;
第二种:输入通道个数少,输出通道个数多;The second type: fewer input channels and more output channels;
第三种:输入通道数个多,输出通道个数少;The third type: more input channels and less output channels;
第四种:输入通道个数多,输出通道个数多。The fourth kind: there are many input channels and many output channels.
进一步的,上述步骤(2)中,从输入源中获取卷积层结构与参数的步骤如下:Further, in the above step (2), the steps of obtaining the convolutional layer structure and parameters from the input source are as follows:
1)、获取卷积层的权重张量的形状,从而解析出卷积层的卷积核个数,卷积核的大小以及步长;1) Obtain the shape of the weight tensor of the convolutional layer, so as to analyze the number of convolution kernels, the size and step size of the convolution kernel;
2)、获取卷积层输入特征图张量的形状,解析出卷积层输入特征图的大小,输入通道数;2) Obtain the shape of the tensor of the input feature map of the convolution layer, and analyze the size of the input feature map of the convolution layer and the number of input channels;
3)、将卷积层输入特征图的值,卷积层权重与偏置的值量化并转换为硬件格式数据文件;3) Enter the value of the convolutional layer into the feature map, and quantize and convert the convolutional layer weight and offset values into a hardware format data file;
进一步的,所述选取最优的加速器方案步骤具体如下:Further, the steps of selecting the optimal accelerator solution are as follows:
1)、判断是否属于第一种卷积层结构,如果是,优先采用第二种加速方案,否则,执行2);1). Determine whether it belongs to the first convolutional layer structure, if it is, the second acceleration scheme is preferred, otherwise, perform 2);
2)、判断是否属于第二种卷积层结构,如果是,优先采用第一种或者第二种加速方案,否则,执行3);2). Determine whether it belongs to the second convolutional layer structure, if it is, the first or second acceleration scheme is preferred, otherwise, perform 3);
3)、判断是否属于第三种卷积层结构,如果是,优先采用第三种加速方案,否则,执行4);3). Determine whether it belongs to the third convolutional layer structure, if it is, the third acceleration scheme is preferred, otherwise, perform 4);
4)、该结构必然属于第四种卷积层结构,优先采用第四种加速方案。4) The structure must belong to the fourth convolutional layer structure, and the fourth acceleration scheme is preferred.
进一步的,上述步骤(2)中,生成最终的卷积层硬件加速器的步骤如下:Further, in the above step (2), the steps of generating the final convolutional layer hardware accelerator are as follows:
a、从输入源中获取卷积层结构与参数,包括含有卷积层结构定义的文件和含有卷积层权重与卷积层偏置的数据文件;a. Obtain the convolutional layer structure and parameters from the input source, including files containing the definition of the convolutional layer structure and data files containing the convolutional layer weights and convolutional layer offsets;
b、由卷积层结构参数,即卷积核大小,输入通道大小,输出通道大小,卷积步长大小,选取最优的加速器方案,生成相应的卷积层加速器。b. According to the structure parameters of the convolution layer, that is, the size of the convolution kernel, the size of the input channel, the size of the output channel, and the size of the convolution step, the optimal accelerator scheme is selected to generate the corresponding convolution layer accelerator.
进一步的,卷积层参数包括权重与偏置,将参数转为硬件格式的数据文件存储在存储器中;卷积层结构中包括输入特征图的输入通道的个数、输入特征图的宽、输入特征图的高、卷积核的个数即输出通道的个数、卷积核的宽、卷积核的高、卷积核宽步长和卷积核高步长。Further, the convolutional layer parameters include weights and offsets, and the parameters are converted into hardware format data files and stored in the memory; the convolutional layer structure includes the number of input channels of the input feature map, the width of the input feature map, the input The height of the feature map and the number of convolution kernels are the number of output channels, the width of the convolution kernel, the height of the convolution kernel, the width step of the convolution kernel and the height step of the convolution kernel.
与现有技术相比,本发明的优点与积极效果在于:Compared with the prior art, the advantages and positive effects of the present invention are:
1、本发明设计了四种卷积层加速器方案,将卷积层结构分为四种,对于不同的卷积层结构使用相应最优的加速器方案,可以大大降低硬件资源消耗,采用并行以及流水线等操作,可以在更低的硬件资源消耗达到类似的计算性能。1. The present invention designs four convolutional layer accelerator solutions, and divides the convolutional layer structure into four types. For different convolutional layer structures, using the corresponding optimal accelerator solution can greatly reduce the consumption of hardware resources, using parallel and pipeline Such operations can achieve similar computing performance at lower hardware resource consumption.
2、本发明可以从输入源中获取卷积层结构与卷积层参数,自适应选取最优方案并生 成硬件加速电路,极大了提高了硬件设计的灵活性和效率。2. The invention can obtain the convolutional layer structure and convolutional layer parameters from the input source, adaptively select the optimal scheme and generate a hardware acceleration circuit, which greatly improves the flexibility and efficiency of hardware design.
3.本方法通过设计加速器方案池,针对不同的卷积层结构可以选取不同的加速器方案,既节省硬件资源又提高硬件并行计算速度;通过自适应选取最优方案,提高了硬件设计的灵活性。3. This method can select different accelerator schemes for different convolutional layer structures by designing an accelerator scheme pool, which not only saves hardware resources but also increases hardware parallel computing speed; by adaptively selecting the optimal scheme, the flexibility of hardware design is increased .
附图说明BRIEF DESCRIPTION
图1是本发明实施方式中的输出通道并行模块与并行加速方案一示意图;1 is a schematic diagram of an output channel parallel module and a parallel acceleration solution in an embodiment of the present invention;
图2是本发明实施方式中并行加速方案二示意图;2 is a schematic diagram of a parallel acceleration solution 2 in an embodiment of the present invention;
图3是本发明实施方式中输入通道并行模块与并行加速方案三示意图;3 is a schematic diagram of three parallel input channels and parallel acceleration schemes in an embodiment of the present invention;
图4是本发明实施方式中并行加速方案四示意图;4 is a schematic diagram of a fourth parallel acceleration scheme in an embodiment of the present invention;
图5是本发明实施方式中输入通道流水线示意图;5 is a schematic diagram of an input channel pipeline in an embodiment of the present invention;
图6是本发明实施方式中卷积核流水线示意图;6 is a schematic diagram of a convolution kernel pipeline in an embodiment of the present invention;
图7是本发明实施方式中切分后输入通道流水线示意图;7 is a schematic diagram of an input channel pipeline after segmentation in the embodiment of the present invention;
图8是本发明实施方式中卷积窗口流水线示意图;8 is a schematic diagram of a convolution window pipeline in an embodiment of the present invention;
图9是本发明实施方式中自适应加速器设计流程示意图;9 is a schematic diagram of an adaptive accelerator design process in an embodiment of the present invention;
图10是本发明实施方式中由卷积层结构自适应选择最优方案流程示意图。FIG. 10 is a schematic flow chart of adaptively selecting the optimal scheme by the convolutional layer structure in the embodiment of the present invention.
具体实施方式detailed description
下面结合附图对本发明的具体实施方式做进一步说明。但本发明的实施方式并不限于此。The specific implementation of the present invention will be further described below with reference to the drawings. However, the embodiments of the present invention are not limited to this.
一种自适应卷积层硬件加速器设计方法。对卷积层结构进行分析,根据输入通道数与卷积核个数的不同将卷积层结构分为四种,针对四种不同的卷积层结构,设计了四种不同的硬件加速器方案,将所有加速器方案存放在存储区中,并称为加速器方案池;An adaptive convolutional layer hardware accelerator design method. Based on the analysis of the convolutional layer structure, the convolutional layer structure is divided into four types according to the difference between the number of input channels and the number of convolution kernels. For four different convolutional layer structures, four different hardware accelerator solutions are designed. Store all accelerator solutions in the storage area and call it the accelerator solution pool;
设N为输入特征图通道即输入通道的个数,W为输入特征图的宽,H为输入特征图的高,M为卷积核的个数即输出通道的个数,W k为卷积核的宽,H k为卷积核的高,W s为卷积核宽步长,H s为卷积核高步长,输出特征图的宽W o,输出特征图的高H o和每个输入通道产生的卷积窗口个数G满足: Let N be the number of input feature map channels, that is, the number of input channels, W be the width of the input feature map, H be the height of the input feature map, M be the number of convolution kernels that is the number of output channels, and W k be the convolution The width of the kernel, H k is the height of the convolution kernel, W s is the width step of the convolution kernel, H s is the height step of the convolution kernel, the output feature map width W o , the output feature map height H o and each The number G of convolution windows generated by each input channel satisfies
Figure PCTCN2019114910-appb-000001
Figure PCTCN2019114910-appb-000001
Figure PCTCN2019114910-appb-000002
Figure PCTCN2019114910-appb-000002
G=W o*H o#(3) G=W o *H o #(3)
卷积运算公式为:The convolution operation formula is:
Figure PCTCN2019114910-appb-000003
Figure PCTCN2019114910-appb-000003
其中
Figure PCTCN2019114910-appb-000004
表示第m个输出通道的第g个窗口的输出,
Figure PCTCN2019114910-appb-000005
表示输入特征图的第n个输入通道的第g个窗口第j行第i列值,w mnij表示第m个卷积核的第n个通道的第j行第i列权重,b m表示第m个卷积核的偏置。
among them
Figure PCTCN2019114910-appb-000004
Represents the output of the g th window of the m th output channel,
Figure PCTCN2019114910-appb-000005
Represents the value of the i-th column, row i, and i-th value of the g-th window of the n-th input channel of the input feature map, w mnij represents the weight of the i-th column, row i, and n-th channel of the m-th convolution kernel, and b m represents Offset of m convolution kernels.
符号说明:Symbol Description:
Figure PCTCN2019114910-appb-000006
Figure PCTCN2019114910-appb-000006
根据公式(4),则第n个输入通道的第g个卷积窗口与第m个卷积核的第n个通道卷积后的中间结果
Figure PCTCN2019114910-appb-000007
如下,其中⊙表示卷积运算。
According to formula (4), the intermediate result after the convolution of the g-th convolution window of the n-th input channel and the n-th channel of the m-th convolution kernel
Figure PCTCN2019114910-appb-000007
As follows, where ⊙ represents the convolution operation.
Figure PCTCN2019114910-appb-000008
Figure PCTCN2019114910-appb-000008
则第g个卷积窗口卷积输出的第m个通道
Figure PCTCN2019114910-appb-000009
计算如下:
Then the mth channel of the convolution output of the gth convolution window
Figure PCTCN2019114910-appb-000009
The calculation is as follows:
Figure PCTCN2019114910-appb-000010
Figure PCTCN2019114910-appb-000010
定义矩阵A (g),其中第m行第n列数据为
Figure PCTCN2019114910-appb-000011
Define the matrix A (g) , where the data in the mth row and nth column are
Figure PCTCN2019114910-appb-000011
Figure PCTCN2019114910-appb-000012
Figure PCTCN2019114910-appb-000012
则矩阵A (g)中的第n列向量
Figure PCTCN2019114910-appb-000013
Then the nth column vector in matrix A (g)
Figure PCTCN2019114910-appb-000013
for
Figure PCTCN2019114910-appb-000014
Figure PCTCN2019114910-appb-000014
举证A (g)的第m行向量
Figure PCTCN2019114910-appb-000015
Proof of the mth row vector of A (g)
Figure PCTCN2019114910-appb-000015
for
Figure PCTCN2019114910-appb-000016
Figure PCTCN2019114910-appb-000016
卷积层偏置向量b,其中第m个输出通道的偏置为b m Convolution layer offset vector b, where the mth output channel offset is b m
Figure PCTCN2019114910-appb-000017
Figure PCTCN2019114910-appb-000017
输出特征图向量C (g),其中第g个卷积窗口的第m个输出通道的值为
Figure PCTCN2019114910-appb-000018
Output feature map vector C (g) , where the value of the m-th output channel of the g-th convolution window is
Figure PCTCN2019114910-appb-000018
Figure PCTCN2019114910-appb-000019
Figure PCTCN2019114910-appb-000019
由(7)(8),可以得出第g个卷积窗口的卷积中结果矩阵A (g)满足 From (7)(8), the result matrix A (g) in the convolution of the g-th convolution window can be obtained
Figure PCTCN2019114910-appb-000020
Figure PCTCN2019114910-appb-000020
由(7)(9),可以得出From (7)(9), we can get
Figure PCTCN2019114910-appb-000021
Figure PCTCN2019114910-appb-000021
由公式(6)和定义(7)(8)(10)(11)可以推出It can be derived from formula (6) and definition (7)(8)(10)(11)
Figure PCTCN2019114910-appb-000022
Figure PCTCN2019114910-appb-000022
由公式(6)和定义(9)可以推出It can be derived from formula (6) and definition (9)
Figure PCTCN2019114910-appb-000023
Figure PCTCN2019114910-appb-000023
步骤一:如图1所示,
Figure PCTCN2019114910-appb-000024
与w ·n中的每个元素利用(5)(8)得到
Figure PCTCN2019114910-appb-000025
该过程对m个输出通道并行计算,从而称为输出通道并行,并将图1的结构封装为输出通道并行模块,该 模块的输入为
Figure PCTCN2019114910-appb-000026
和w ·n,输出为
Figure PCTCN2019114910-appb-000027
Step 1: As shown in Figure 1,
Figure PCTCN2019114910-appb-000024
And each element in w ·n is obtained by (5)(8)
Figure PCTCN2019114910-appb-000025
This process calculates m output channels in parallel, which is called output channel parallelism, and encapsulates the structure of Figure 1 as an output channel parallel module.
Figure PCTCN2019114910-appb-000026
And w · n , the output is
Figure PCTCN2019114910-appb-000027
步骤二:如图5所示,利用上述步骤一中的输出通道并行模块对n个输入通道进行流水线操作,即每个时钟周期输入一个输入通道,根据(12)得到A (g),再根据(14)得到所有卷积输出通道的值C (g)。如图8所示,再对G个卷积窗口进行流水线操作,得到所有卷积输出特征图。该方案称为并行加速方案一。假设一个卷积窗口运算完所需时钟周期为T,加法运算采用加法树运算,则由以上分析可知整个卷积运算所需时钟周期为T+N+G。所消耗乘法器个数为M*W k*H k,加法器个数为
Figure PCTCN2019114910-appb-000028
Step 2: As shown in Figure 5, the output channel parallel module in step 1 is used to pipeline the n input channels, that is, one input channel is input every clock cycle, and A (g) is obtained according to (12 ) , and then according to (14) Obtain the value C (g) of all convolution output channels. As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called parallel acceleration scheme one. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+N+G. The number of multipliers consumed is M*W k *H k , and the number of adders is
Figure PCTCN2019114910-appb-000028
步骤三:如图2所示,对所有N个输入通道经过图1中的输出通道并行模块运算和(12)得到A (g),再利用(14)得到所有卷积输出通道的值C (g)。如图8所示,再对G个卷积窗口进行流水线操作,得到所有卷积输出特征图。该方案称为并行加速方案二。假设一个卷积窗口运算完所需时钟周期为T,加法运算采用加法树运算,则由以上分析可知整个卷积运算所需时钟周期为T+G。所消耗乘法器个数为N*M*W k*H k,加法器个数为
Figure PCTCN2019114910-appb-000029
Step 3: As shown in Figure 2, for all N input channels, the output channel in Figure 1 is paralleled and summed (12) to obtain A (g) , and then (14) is used to obtain the value C of all convolution output channels ( g) . As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called parallel acceleration scheme 2. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+G. The number of multipliers consumed is N*M*W k *H k , and the number of adders is
Figure PCTCN2019114910-appb-000029
步骤四:如图3所示,x (g)与w 利用(5)(9)得到
Figure PCTCN2019114910-appb-000030
在利用(6)得到
Figure PCTCN2019114910-appb-000031
该过程对N个输入通道并行计算,从而称为输入通道并行,并将图3的结构封装为输入通道并行模块,该模块的输入为x (g)、w 和b m,输出为
Figure PCTCN2019114910-appb-000032
Step 4: As shown in Figure 3, x (g) and w m· are obtained by (5)(9)
Figure PCTCN2019114910-appb-000030
Get in (6)
Figure PCTCN2019114910-appb-000031
This process calculates N input channels in parallel, which is called input channel parallel, and encapsulates the structure of Figure 3 as an input channel parallel module. The input of this module is x (g) , w and b m , and the output is
Figure PCTCN2019114910-appb-000032
步骤五:如图6所示,利用上述步骤四中输入通道并行模块对M个卷积核进行流水线操作,即每个时钟输入一个输出通道,根据(13)得到A (g),再根据(11)和(15)得到所有卷积输出通道的值C (g)。如图8所示,再对G个卷积窗口进行流水线操作,得到所有卷积输出特征图。该方案称为并行加速方案三。假设一个卷积窗口运算完所需时钟周期为T,加法运算采用加法树运算,则由以上分析可知整个卷积运算所需时钟周期为T+M+G。所消耗乘法器个数为N*W k*H k,加法器个数为
Figure PCTCN2019114910-appb-000033
Figure PCTCN2019114910-appb-000034
Step 5: As shown in Figure 6, the input channel parallel module in step 4 is used to pipeline the M convolution kernels, that is, each clock is input to an output channel, and A (g) is obtained according to (13 ) , and then according to ( 11) and (15) get the value C (g) of all convolution output channels. As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called Parallel Acceleration Scheme 3. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+M+G. The number of multipliers consumed is N*W k *H k , and the number of adders is
Figure PCTCN2019114910-appb-000033
Figure PCTCN2019114910-appb-000034
步骤六:将N个输入通道分成Q份,为了使每一份计算量相同,令每一份输入通道的个数为Step 6: Divide the N input channels into Q shares. In order to make the calculation amount of each copy the same, let the number of input channels of each copy be
Figure PCTCN2019114910-appb-000035
Figure PCTCN2019114910-appb-000035
前Q-1份输入通道的个数都为u,第Q份输入通道的个数只有N-uQ个,从而将第Q份填补u(Q+1)-N个值为0的输入通道。The number of the first Q-1 input channels is u, and the number of Q-th input channels is only N-uQ, so that the Q-th component is filled with u(Q+1)-N input channels with a value of 0.
由(12)可知,将A (g)分成Q份,从而第q份输入通道的下标范围为[(q-1)u+1,qu],其中第q份输入通道的第g个卷积窗口对应的卷积中间输出为 It can be seen from (12) that A (g) is divided into Q parts, so that the subscript range of the qth input channel is [(q-1)u+1,qu], where the gth volume of the qth input channel The intermediate output of the convolution corresponding to the product window is
Figure PCTCN2019114910-appb-000036
Figure PCTCN2019114910-appb-000036
令第q份输入通道第g个窗口的第m个卷积输出通道中间值
Figure PCTCN2019114910-appb-000037
Let the middle value of the mth convolution output channel of the gth window of the qth input channel
Figure PCTCN2019114910-appb-000037
for
Figure PCTCN2019114910-appb-000038
Figure PCTCN2019114910-appb-000038
则由公式(6)得Then from formula (6)
Figure PCTCN2019114910-appb-000039
Figure PCTCN2019114910-appb-000039
令第q份输入通道第g个窗口输出特征图
Figure PCTCN2019114910-appb-000040
Make the gth window of the qth input channel output the characteristic map
Figure PCTCN2019114910-appb-000040
for
Figure PCTCN2019114910-appb-000041
Figure PCTCN2019114910-appb-000041
则由公式(19)(20)得Then from formula (19)(20)
Figure PCTCN2019114910-appb-000042
Figure PCTCN2019114910-appb-000042
如图4所示,对第q份的所有u个输入通道经过图1中的输出通道并行模块运算可以得到(17)中的
Figure PCTCN2019114910-appb-000043
然后由(18)(20)得到
Figure PCTCN2019114910-appb-000044
如图7所示,再对Q份输入通道进行流水线操作,由(21)计算出所有卷积输出通道的值C (g)。如图8所示,再对G个卷积窗口进行流水线操作,得到所有卷积输出特征图。该方案称为并行加速方案四。假设一个卷积窗口运算完所需时钟周期为T,加法运算采用加法树运算,则由以上分析可知整个卷积运算所需时钟周期为T+Q+G。所消耗乘法器个数为u*M*W k*H k,加法器个数为
Figure PCTCN2019114910-appb-000045
As shown in Figure 4, all u input channels of the qth part can be obtained by (17) through the operation of the output channel parallel module in Figure 1
Figure PCTCN2019114910-appb-000043
Then get from (18)(20)
Figure PCTCN2019114910-appb-000044
As shown in Figure 7, pipeline operations are performed on the Q input channels, and the value C (g) of all convolution output channels is calculated by (21 ) . As shown in Figure 8, pipeline operations are performed on the G convolution windows to obtain all convolution output feature maps. This scheme is called Parallel Acceleration Scheme 4. Assuming that the clock period required for the operation of a convolution window is T, and the addition operation uses the addition tree operation, the above analysis shows that the clock period required for the entire convolution operation is T+Q+G. The number of multipliers consumed is u*M*W k *H k , and the number of adders is
Figure PCTCN2019114910-appb-000045
步骤八:如图9所示,从输入源中获取卷积层结构与卷积层参数,根据卷积层结构从加速器方案池中选取最优加速器方案,并由加速器方案构建相应的卷积层加速器,结合 网络参数,并生成最终的硬件加速器。卷积层参数包括权重w与偏置b,将参数转为硬件格式的数据文件存储在存储器中;卷积层结构中包括输入特征图的输入通道的个数N,输入特征图的宽W,输入特征图的高H,卷积核的个数即输出通道的个数M,卷积核的宽W k,卷积核的高H k,卷积核宽步长W s,卷积核高步长H s,将这些值存储到参数文件中;由这些参数从加速器方案池中选择最优加速方案,最后生成卷积硬件加速器。 Step 8: As shown in Figure 9, obtain the convolutional layer structure and convolutional layer parameters from the input source, select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer from the accelerator solution The accelerator combines the network parameters and generates the final hardware accelerator. The convolutional layer parameters include weight w and offset b. The parameters are converted into hardware format data files and stored in the memory; the convolutional layer structure includes the number of input channels of the input feature map N, the width of the input feature map W, The height H of the input feature map, the number of convolution kernels is the number of output channels M, the width of the convolution kernel W k , the height of the convolution kernel H k , the width step of the convolution kernel W s , the height of the convolution kernel Step size H s , store these values in the parameter file; from these parameters, select the optimal acceleration scheme from the accelerator scheme pool, and finally generate the convolution hardware accelerator.
步骤九:如图10所示,选取最优加速器方案的流程为:Step 9: As shown in Figure 10, the process of selecting the optimal accelerator solution is:
1.指定输入通道个数阈值N i以及输出通道个数阈值N o1. The input channel specified threshold number N i and output channel number threshold value N o.
1.首先判断是否人为指定了方案,如果是则选择该方案并结束,否则执行2;1. First judge whether the scheme is artificially specified, if it is, select the scheme and end, otherwise execute 2;
2.判断是否给出了硬件资源消耗与速度要求,如果是则计算加速器方案池中所有方案的硬件资源消耗与运行速度并执行3,否则执行4;2. Determine whether the hardware resource consumption and speed requirements are given. If yes, calculate the hardware resource consumption and running speed of all programs in the accelerator program pool and execute 3, otherwise execute 4;
3.判断是否存在符合要求的方案,如果是则选择该方案并结束,否则执行7;3. Determine whether there is a plan that meets the requirements, if it is, select the plan and end, otherwise perform 7;
4.如果输入通道个数小于N i,则执行5,否则执行6; 4. If the number of input channels is less than N i , go to 5, otherwise go to 6;
5.如果输出通道个数小于N o,则选择第二种硬件加速器方案,否则选择第一种或者第二种硬件加速器方案;然后使用该方案并执行7; 5. If the number of output channels is less than N o, the second hardware accelerator is selected, but otherwise, selecting a first program or the second hardware accelerator; and then the program execution 7;
6.如果输出通道个数小于N o,则选择第三种硬件加速器方案,否则选择第四种硬件加速器方案;然后使用该方案并执行7; 6. If the number of output channels is less than N o, the selected hardware accelerator third embodiment, or a hardware accelerator to select the fourth embodiment; and was then performed using this scheme 7;
7.判断是否继续,如果是执行1,否则结束。7. Determine whether to continue, if it is executed 1, otherwise it ends.

Claims (10)

  1. 一种自适应卷积层硬件加速器设计方法,其特征在于包括如下步骤:An adaptive convolutional layer hardware accelerator design method, which is characterized by the following steps:
    (1)对卷积层结构进行分析,针对不同的卷积层结构设计了四种不同的硬件加速器方案,并将四种不同的硬件加速器方案存储在加速器方案池中;(1) Analyze the convolutional layer structure, design four different hardware accelerator solutions for different convolutional layer structures, and store the four different hardware accelerator solutions in the accelerator solution pool;
    (2)从输入源中获取卷积层结构与卷积层参数,然后根据卷积层结构从加速器方案池中选取最优加速器方案,并由加速器方案构建相应的卷积层加速器。(2) Obtain the convolutional layer structure and convolutional layer parameters from the input source, and then select the optimal accelerator solution from the accelerator solution pool according to the convolutional layer structure, and construct the corresponding convolutional layer accelerator from the accelerator solution.
  2. 根据权利要求1所述的硬件加速器设计方法,其特征在于:加速器方案池中包含如下硬件加速器方案:The hardware accelerator design method according to claim 1, wherein the accelerator solution pool includes the following hardware accelerator solutions:
    并行加速方案一,对输出通道进行并行运算,分别对输入通道和卷积窗口进行流水操作;Parallel acceleration scheme one, parallel operation is performed on the output channel, and pipeline operations are performed on the input channel and the convolution window respectively;
    并行加速方案二,对输出通道与输入通道进行并行运算,对卷积窗口进行流水线操作;Parallel acceleration scheme two, parallel operation of the output channel and input channel, and pipeline operation of the convolution window;
    并行加速方案三,对输入通道进行并行运算,分别对输出通道和卷积窗口进行流水线操作;Parallel acceleration scheme three, parallel operation is performed on the input channel, and pipeline operations are performed on the output channel and the convolution window, respectively;
    并行加速方案四,对部分输入通道与输出通道进行并行运算,分别对部分输入通道和卷积窗口进行流水线操作;Parallel acceleration scheme four, parallel operation is performed on some input channels and output channels, and pipeline operations are performed on some input channels and convolution windows, respectively;
    将四种硬件加速器方案存储在存储区中,称为加速器方案池。The four hardware accelerator solutions are stored in the storage area, called the accelerator solution pool.
  3. 根据权利要求1所述的硬件加速器设计方法,其特征在于:硬件加速器由最优加速器方案与卷积层参数生成。The hardware accelerator design method according to claim 1, wherein the hardware accelerator is generated by an optimal accelerator scheme and convolution layer parameters.
  4. 根据权利要求2所述的硬件加速器设计方法,其特征在于:所述的并行加速方案四的过程为将输入通道分成若干个等份,对每一份的若干个输入通道的一个卷积窗口与所有卷积核进行卷积运算;然后对若干份输入通道进行流水线操作,从而得到所有输入通道的一个卷积窗口卷积输出;然后再对卷积窗口进行流水线操作,得到所有输入通道的卷积输出。The method for designing a hardware accelerator according to claim 2, wherein the process of the parallel acceleration scheme 4 is to divide the input channel into several equal parts, and a convolution window of several input channels of each part is All convolution kernels perform convolution operations; then pipeline operations are performed on several input channels to obtain a convolution window convolution output of all input channels; and then the convolution window is pipelined to obtain convolutions of all input channels Output.
  5. 根据权利要求3所述的硬件加速器设计方法,其特征在于:步骤(2)具体包括:从输入源中获得输入特征图的高和宽,以及输入特征图的输入通道数,获得卷积核的高和宽,卷积核的个数,以及宽步长与高步长;获得输入特征图,卷积层权重与卷积层偏置的值;由卷积层的参数估算每个加速方案所消耗的硬件资源以及所需要的时钟周期;将这些估算的结果结合用户针对任务所限制的需求来选择出最优的加速器方案,从而生成卷积层加速器。The method for designing a hardware accelerator according to claim 3, wherein step (2) specifically comprises: obtaining the height and width of the input feature map from the input source, and the number of input channels of the input feature map to obtain the convolution kernel Height and width, the number of convolution kernels, and wide step and high step; obtain the input feature map, the weight of the convolution layer and the offset of the convolution layer; the parameters of the convolution layer estimate each acceleration scheme. The hardware resources consumed and the required clock cycles; combine these estimated results with the user's limited requirements for the task to select the optimal accelerator solution to generate a convolutional accelerator.
  6. 根据权利要求1所述的硬件加速器设计方法,其特征在于:通过用户指定输入通道个数阈值N i和输出通道个数阈值N o,将卷积层结构分为以下四种:输入通道个数小于N i, 输出通道个数小于N o;输入通道个数小于N i,输出通道个数大于N o;输入通道数个大于N i,输出通道个数小于N o;输入通道个数大于N i,输出通道个数大于N oThe hardware accelerator design method according to claim 1, wherein: the channel number designated by the user input the threshold value N i and output channel number threshold value N o, convolution-layer structure into the following four: the number of input channels Less than N i , the number of output channels is less than N o ; the number of input channels is less than N i , the number of output channels is greater than N o ; the number of input channels is greater than N i , the number of output channels is less than N o ; the number of input channels is greater than N i, the number of output channels is greater than N o.
  7. 根据权利要求1所述的硬件加速器设计方法,其特征在于:所述从输入源中获取卷积层结构与参数的具体步骤如下:The hardware accelerator design method according to claim 1, wherein the specific steps of obtaining the convolutional layer structure and parameters from the input source are as follows:
    1)、获取卷积层的权重张量的形状,从而解析出卷积层的卷积核个数,卷积核的大小以及步长;1) Obtain the shape of the weight tensor of the convolutional layer, so as to analyze the number of convolution kernels, the size and step size of the convolution kernel;
    2)、获取卷积层输入特征图张量的形状,解析出卷积层输入特征图的大小,输入通道数;2) Obtain the shape of the tensor of the input feature map of the convolution layer, and analyze the size of the input feature map of the convolution layer and the number of input channels;
    3)、将卷积层输入特征图的值,卷积层权重与偏置的值量化并转换为硬件格式数据文件。3) Input the value of the convolutional layer into the feature map, quantize and convert the convolutional layer weight and offset values into a hardware format data file.
  8. 根据权利要求5所述的硬件加速器设计方法,其特征在于:所述选取最优的加速器方案步骤如下:The method for designing a hardware accelerator according to claim 5, wherein the steps of selecting an optimal accelerator solution are as follows:
    1、判断是否属于第一种卷积层结构,如果是,优先采用第二种加速方案,否则,执行2;1. Determine whether it belongs to the first type of convolutional layer structure, if it is, the second type of acceleration scheme is preferred, otherwise, go to 2;
    2、判断是否属于第二种卷积层结构,如果是,优先采用第一种或者第二种加速方案,否则,执行3;2. Determine whether it belongs to the second convolutional layer structure. If it is, the first or second acceleration scheme is preferred, otherwise, go to 3;
    3、判断是否属于第三种卷积层结构,如果是,优先采用第三种加速方案,否则,执行4;3. Determine whether it belongs to the third convolutional layer structure. If it is, the third acceleration scheme is preferred, otherwise, go to 4;
    4、该结构必然属于第四种卷积层结构,优先采用第四种加速方案。4. This structure must belong to the fourth convolutional layer structure, and the fourth acceleration scheme is preferred.
  9. 根据权利要求1所述的硬件加速器设计方法,其特征在于:步骤(2)具体包括:The hardware accelerator design method according to claim 1, wherein step (2) specifically includes:
    a、从输入源中获取卷积层结构与参数,包括含有卷积层结构定义的文件和含有卷积层权重与卷积层偏置的数据文件;a. Obtain the convolutional layer structure and parameters from the input source, including files containing the definition of the convolutional layer structure and data files containing the convolutional layer weights and convolutional layer offsets;
    b、由卷积层结构参数,即卷积核大小,输入通道大小,输出通道大小,卷积步长大小,选取最优的加速器方案,生成相应的卷积层加速器。b. According to the structure parameters of the convolution layer, that is, the size of the convolution kernel, the size of the input channel, the size of the output channel, and the size of the convolution step, the optimal accelerator scheme is selected to generate the corresponding convolution layer accelerator.
  10. 根据权利要求1所述的硬件加速器设计方法,其特征在于:卷积层参数包括权重与偏置;卷积层结构中包括输入特征图的输入通道的个数、输入特征图的宽、输入特征图的高、卷积核的个数即输出通道的个数、卷积核的宽、卷积核的高、卷积核宽步长和卷积核高步长。The hardware accelerator design method according to claim 1, wherein: the convolutional layer parameters include weights and offsets; the convolutional layer structure includes the number of input channels of the input feature map, the width of the input feature map, and the input features The height of the graph and the number of convolution kernels are the number of output channels, the width of the convolution kernel, the height of the convolution kernel, the width step of the convolution kernel and the height step of the convolution kernel.
PCT/CN2019/114910 2018-12-15 2019-10-31 Self-adaptive selection and design method for convolutional-layer hardware accelerator WO2020119318A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811537915.0A CN109740731B (en) 2018-12-15 2018-12-15 Design method of self-adaptive convolution layer hardware accelerator
CN201811537915.0 2018-12-15

Publications (1)

Publication Number Publication Date
WO2020119318A1 true WO2020119318A1 (en) 2020-06-18

Family

ID=66360373

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114910 WO2020119318A1 (en) 2018-12-15 2019-10-31 Self-adaptive selection and design method for convolutional-layer hardware accelerator

Country Status (2)

Country Link
CN (1) CN109740731B (en)
WO (1) WO2020119318A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950656A (en) * 2021-03-09 2021-06-11 北京工业大学 Block convolution method for pre-reading data according to channel based on FPGA platform

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740731B (en) * 2018-12-15 2023-07-18 华南理工大学 Design method of self-adaptive convolution layer hardware accelerator
CN110084363B (en) * 2019-05-15 2023-04-25 电科瑞达(成都)科技有限公司 Deep learning model acceleration method based on FPGA platform
CN110263923B (en) * 2019-08-12 2019-11-29 上海燧原智能科技有限公司 Tensor convolutional calculation method and system
CN110503201A (en) * 2019-08-29 2019-11-26 苏州浪潮智能科技有限公司 A kind of neural network distributed parallel training method and device
CN110929860B (en) * 2019-11-07 2020-10-23 深圳云天励飞技术有限公司 Convolution acceleration operation method and device, storage medium and terminal equipment
CN111047010A (en) * 2019-11-25 2020-04-21 天津大学 Method and device for reducing first-layer convolution calculation delay of CNN accelerator
CN111738433B (en) * 2020-05-22 2023-09-26 华南理工大学 Reconfigurable convolution hardware accelerator
CN111931909B (en) * 2020-07-24 2022-12-20 北京航空航天大学 Lightweight convolutional neural network reconfigurable deployment method based on FPGA
CN114186677A (en) * 2020-09-15 2022-03-15 中兴通讯股份有限公司 Accelerator parameter determination method and device and computer readable medium
CN115145839B (en) * 2021-03-31 2024-05-14 广东高云半导体科技股份有限公司 Depth convolution accelerator and method for accelerating depth convolution

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869117A (en) * 2016-03-28 2016-08-17 上海交通大学 Method for accelerating GPU directed at deep learning super-resolution technology
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN207993065U (en) * 2017-01-04 2018-10-19 意法半导体股份有限公司 Configurable accelerator frame apparatus and the system for depth convolutional neural networks
CN108805267A (en) * 2018-05-28 2018-11-13 重庆大学 The data processing method hardware-accelerated for convolutional neural networks
CN108875915A (en) * 2018-06-12 2018-11-23 辽宁工程技术大学 A kind of depth confrontation network optimized approach of Embedded application
CN109740731A (en) * 2018-12-15 2019-05-10 华南理工大学 A kind of adaptive convolutional layer hardware accelerator design method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN207440765U (en) * 2017-01-04 2018-06-01 意法半导体股份有限公司 System on chip and mobile computing device
EP3346427B1 (en) * 2017-01-04 2023-12-20 STMicroelectronics S.r.l. Configurable accelerator framework, system and method
CN107993186B (en) * 2017-12-14 2021-05-25 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108280514B (en) * 2018-01-05 2020-10-16 中国科学技术大学 FPGA-based sparse neural network acceleration system and design method
CN108133270B (en) * 2018-01-12 2020-08-04 清华大学 Convolutional neural network acceleration method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869117A (en) * 2016-03-28 2016-08-17 上海交通大学 Method for accelerating GPU directed at deep learning super-resolution technology
US20180157969A1 (en) * 2016-12-05 2018-06-07 Beijing Deephi Technology Co., Ltd. Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN207993065U (en) * 2017-01-04 2018-10-19 意法半导体股份有限公司 Configurable accelerator frame apparatus and the system for depth convolutional neural networks
CN108805267A (en) * 2018-05-28 2018-11-13 重庆大学 The data processing method hardware-accelerated for convolutional neural networks
CN108875915A (en) * 2018-06-12 2018-11-23 辽宁工程技术大学 A kind of depth confrontation network optimized approach of Embedded application
CN109740731A (en) * 2018-12-15 2019-05-10 华南理工大学 A kind of adaptive convolutional layer hardware accelerator design method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAO, HAO ET AL.: "Design of FPGA Hardware Accelerator for Convolutional Neural Network", INDUSTRIAL CONTROL COMPUTER, vol. 31, no. 6, 30 June 2018 (2018-06-30) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112950656A (en) * 2021-03-09 2021-06-11 北京工业大学 Block convolution method for pre-reading data according to channel based on FPGA platform

Also Published As

Publication number Publication date
CN109740731A (en) 2019-05-10
CN109740731B (en) 2023-07-18

Similar Documents

Publication Publication Date Title
WO2020119318A1 (en) Self-adaptive selection and design method for convolutional-layer hardware accelerator
KR102434729B1 (en) Processing method and apparatus
US10929746B2 (en) Low-power hardware acceleration method and system for convolution neural network computation
Park et al. Big/little deep neural network for ultra low power inference
US20180174036A1 (en) Hardware Accelerator for Compressed LSTM
US20190188237A1 (en) Method and electronic device for convolution calculation in neutral network
CN110363281A (en) A kind of convolutional neural networks quantization method, device, computer and storage medium
CN113874883A (en) Hand pose estimation
CN111161306B (en) Video target segmentation method based on motion attention
KR102655950B1 (en) High speed processing method of neural network and apparatus using thereof
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
WO2017161646A1 (en) Method for dynamically selecting optimal model by three-layer association for large data volume prediction
Li et al. Dynamic dataflow scheduling and computation mapping techniques for efficient depthwise separable convolution acceleration
CN111105023A (en) Data stream reconstruction method and reconfigurable data stream processor
CN113222998B (en) Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
CN110647974A (en) Network layer operation method and device in deep neural network
CN112001294A (en) YOLACT + + based vehicle body surface damage detection and mask generation method and storage device
Liu et al. Toward full-stack acceleration of deep convolutional neural networks on FPGAs
Manatunga et al. SP-CNN: A scalable and programmable CNN-based accelerator
Ni et al. Algorithm-hardware co-design for efficient brain-inspired hyperdimensional learning on edge
CN113240090B (en) Image processing model generation method, image processing device and electronic equipment
CN116187416A (en) Iterative retraining method based on layer pruning sensitivity and image processor
US20220044098A1 (en) Methods and systems for running dynamic recurrent neural networks in hardware
CN114819051A (en) Calibration method and device for analog circuit for performing neural network calculation
Lou et al. OctCNN: An energy-efficient FPGA accelerator for CNNs using octave convolution algorithm

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19896413

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19896413

Country of ref document: EP

Kind code of ref document: A1