CN109740731B - Design method of self-adaptive convolution layer hardware accelerator - Google Patents

Design method of self-adaptive convolution layer hardware accelerator Download PDF

Info

Publication number
CN109740731B
CN109740731B CN201811537915.0A CN201811537915A CN109740731B CN 109740731 B CN109740731 B CN 109740731B CN 201811537915 A CN201811537915 A CN 201811537915A CN 109740731 B CN109740731 B CN 109740731B
Authority
CN
China
Prior art keywords
convolution
convolution layer
accelerator
scheme
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811537915.0A
Other languages
Chinese (zh)
Other versions
CN109740731A (en
Inventor
秦华标
曹钦平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201811537915.0A priority Critical patent/CN109740731B/en
Publication of CN109740731A publication Critical patent/CN109740731A/en
Priority to PCT/CN2019/114910 priority patent/WO2020119318A1/en
Application granted granted Critical
Publication of CN109740731B publication Critical patent/CN109740731B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a design method of a self-adaptive convolution layer hardware accelerator, which comprises the steps of designing a convolution layer accelerator scheme pool, adaptively selecting an optimal scheme and generating the hardware accelerator. The method comprises the steps of firstly analyzing the characteristics and the parallelism of a convolution layer structure, then designing four different accelerator schemes by evaluating the consumption and the running speed of hardware resources, storing all the accelerator schemes in a storage area, and calling the storage area as an accelerator scheme pool, finally acquiring the convolution layer structure and parameters from an input source, selecting a corresponding optimal accelerator scheme according to the different convolution layer structures, and generating a final hardware accelerator. The method and the device have the advantages that the scheme pool of the convolution layer accelerator is designed, the optimal scheme is selected in a self-adaptive mode, and the hardware accelerator is generated, so that the hardware design has stronger flexibility, the resource consumption can be reduced, and the parallel operation speed of the convolution layer can be improved.

Description

Design method of self-adaptive convolution layer hardware accelerator
Technical Field
The invention relates to a convolutional neural network hardware accelerator design, belongs to the technical field of integrated circuit hardware acceleration, and particularly relates to a method for adaptively selecting an optimal hardware acceleration scheme and generating a hardware accelerator according to different convolutional layer structures.
Background
In recent years, convolutional neural networks are widely used in the fields of image classification, object detection, speech recognition, and the like. However, convolutional neural networks require more computing and memory resources while achieving very high precision, and many convolutional neural network-based applications must rely on large servers. In the embedded platform with limited resources, deep learning technologies such as convolutional neural networks are increasingly used. Convolutional neural networks typically include a large number of convolutional layers that can be calculated in parallel, so the design of hardware accelerators on the convolutional layers is a necessary development direction in the future.
Regarding the hardware acceleration design of the convolution layer, the main direction of the current research is to accelerate the convolution layer by adopting the same hardware circuit architecture without considering the structure of the convolution layer, and the method is not optimized aiming at different structures, so that more hardware resources are consumed and the parallel computing speed is reduced; the current hardware design mainly provides a hardware interface, and has relatively many parameters and complex structure for a convolution layer, so that the flexibility of the circuit is very poor. In view of the limitation of the prior art in the acceleration of the convolution layer hardware, corresponding accelerator schemes can be designed for different convolution layer structures, all accelerator scheme storage areas are called accelerator scheme pools, the convolution layer structures are obtained from input sources, then the optimal scheme is selected from the accelerator scheme pools, and finally the hardware accelerator is generated. Through the search of the prior art documents, no report has been made on designing different accelerator schemes for different convolution layer structures and using adaptive selection of the optimal scheme.
Disclosure of Invention
The invention provides a design method of a self-adaptive convolution layer hardware accelerator, which overcomes the defects in the existing convolution layer hardware acceleration technology.
According to the invention, four different accelerator schemes are designed, the optimal scheme is selected in a self-adaptive mode, and the hardware accelerator is generated, so that not only is the flexibility of hardware design improved, but also the resource consumption is reduced, and the operation speed is improved.
The invention is realized by the following technical scheme, and the design method of the self-adaptive convolution layer hardware accelerator comprises the following steps:
(1) Analyzing the characteristics of the convolution layer structure, dividing the convolution layer structure into four types according to the difference of the number of input channels and the number of convolution kernels, designing four different hardware accelerator schemes aiming at the four different convolution layer structures, and storing all accelerator schemes in a storage area and being called an accelerator scheme pool.
(2) And acquiring a convolution layer structure and parameters from an input source, selecting an optimal scheme from an accelerator scheme pool according to the convolution layer structure, constructing a corresponding convolution layer accelerator by an accelerator scheme, combining network parameters, and generating a final hardware accelerator.
Further, in the step (1), the threshold value N of the number of input channels is designated by the user i And output channel number threshold N o The convolution layer structure can be divided into the following four types: the number of the input channels is less than N i The number of the output channels is less than N o The method comprises the steps of carrying out a first treatment on the surface of the The number of the input channels is less than N i The number of the output channels is greater than N o The method comprises the steps of carrying out a first treatment on the surface of the The number of input channels is greater than N i The number of the output channels is less than N o The method comprises the steps of carrying out a first treatment on the surface of the The number of the input channels is greater than N i Output ofThe number of channels is greater than N o
The hardware acceleration scheme is as follows:
and in the first parallel acceleration scheme, parallel operation is carried out on the output channel, and pipeline operation is carried out on the input channel and the convolution window respectively.
And in the second parallel acceleration scheme, parallel operation is carried out on the output channel and the input channel, and pipeline operation is carried out on the convolution window.
The parallel acceleration scheme III carries out parallel operation on the input channel and respectively carries out pipeline on the output channel and the convolution window
And (3) operating.
And a parallel acceleration scheme IV, wherein parallel operation is carried out on part of the input channels and the output channels, and pipeline operation is carried out on part of the input channels and the convolution window respectively.
Four hardware accelerator schemes are stored in the memory region, referred to as a pool of accelerator schemes.
Further, adaptively selecting an optimal scheme and generating a hardware accelerator includes the following steps: firstly, obtaining a convolution layer structure and convolution layer parameters from an input source; then selecting an optimal accelerator scheme from the accelerator unit pool according to the convolution layer structure; and finally generating a final hardware accelerator by the optimal accelerator scheme and the convolution layer parameters.
Further, the fourth parallel acceleration scheme is to divide the input channel into a plurality of equal parts, and perform convolution operation on a convolution window of the plurality of input channels and all convolution kernels of each part; then carrying out pipeline operation on a plurality of input channels so as to obtain a convolution window convolution output of all the input channels; and then carrying out pipeline operation on the convolution window to obtain convolution output of all input channels.
Further, the obtaining of the convolution layer parameters includes obtaining the height and width of the input feature map from the input source, obtaining the height and width of the convolution kernel, the number of the convolution kernel, and the width step length and the high step length of the input feature map; obtaining an input feature map, and the values of weights and biases; estimating the hardware resources consumed by each acceleration scheme and the required clock period by the parameters of the convolution layer; the results of these estimations are combined with the user's task-constrained requirements to select the optimal accelerator scheme to generate the convolutional layer hardware accelerator.
Further, according to the relation between the number of input channels and the number of output channels, the convolution layer structure is divided into the following four types:
first kind: the number of input channels is small, and the number of output channels is small.
Second kind: the number of input channels is small, and the number of output channels is large.
Third kind: the number of input channels is more, and the number of output channels is less.
Fourth kind: the number of input channels is large, and the number of output channels is large.
Further, in the step (2), the step of obtaining the convolution layer structure and parameters from the input source is as follows:
1) Acquiring the shape of the weight tensor of the convolution layer, so as to analyze the number of convolution kernels of the convolution layer, the size of the convolution kernels and the step length;
2) Acquiring the shape of the tensor of the input feature map of the convolution layer, analyzing the size of the input feature map of the convolution layer and the number of input channels;
3) Quantizing the values of the convolutional layer input feature map and the values of the convolutional layer weights and offsets, and converting the values into a hardware format data file;
further, the step of selecting the optimal accelerator scheme is specifically as follows:
1) Judging whether the first convolution layer structure belongs to, if so, preferentially adopting a second acceleration scheme, otherwise, executing the step 2);
2) Judging whether the first convolution layer structure belongs to the second convolution layer structure, if so, preferentially adopting the first or second acceleration scheme, otherwise, executing the step 3);
3) Judging whether the data belongs to a third convolution layer structure, if so, preferentially adopting a third acceleration scheme, otherwise, executing the step 4);
4) The structure necessarily belongs to a fourth convolution layer structure, and a fourth acceleration scheme is preferentially adopted.
Further, in the step (2), the step of generating the final convolution layer hardware accelerator is as follows:
a. acquiring a convolution layer structure and parameters from an input source, wherein the convolution layer structure and parameters comprise a file containing a definition of the convolution layer structure and a data file containing a weight of the convolution layer and a bias of the convolution layer;
b. and selecting an optimal accelerator scheme by the structural parameters of the convolution layer, namely the size of the convolution kernel, the size of the input channel, the size of the output channel and the size of the convolution step, and generating a corresponding convolution layer accelerator.
Further, the convolution layer parameters comprise weights and offsets, and the parameters are converted into data files in a hardware format and stored in a memory; the convolution layer structure comprises the number of input channels of the input feature diagram, the width of the input feature diagram, the height of the input feature diagram, the number of convolution kernels, namely the number of output channels, the width of the convolution kernels, the height of the convolution kernels, the width step length of the convolution kernels and the height step length of the convolution kernels.
Compared with the prior art, the invention has the advantages and positive effects that:
1. the invention designs four convolution layer accelerator schemes, divides the convolution layer structure into four types, uses the corresponding optimal accelerator scheme for different convolution layer structures, can greatly reduce the hardware resource consumption, adopts the operations of parallelism, pipelining and the like, and can achieve similar calculation performance at lower hardware resource consumption.
2. The invention can acquire the convolution layer structure and the convolution layer parameters from the input source, adaptively select the optimal scheme and generate the hardware accelerating circuit, thereby greatly improving the flexibility and the efficiency of hardware design.
3. According to the method, by designing the accelerator scheme pool, different accelerator schemes can be selected according to different convolution layer structures, so that hardware resources are saved, and hardware parallel computing speed is improved; and the flexibility of hardware design is improved by adaptively selecting an optimal scheme.
Drawings
FIG. 1 is a schematic diagram of an output channel parallel module and a parallel acceleration scheme according to an embodiment of the present invention.
FIG. 2 is a schematic diagram of a parallel acceleration scheme in an embodiment of the present invention.
FIG. 3 is a schematic diagram of an input channel parallel module and a parallel acceleration scheme according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a parallel acceleration scheme in an embodiment of the present invention.
FIG. 5 is a schematic diagram of an input channel pipeline in an embodiment of the invention.
FIG. 6 is a schematic diagram of a convolution kernel pipeline in an embodiment of the present invention.
FIG. 7 is a schematic diagram of a pipeline of input channels after segmentation in an embodiment of the present invention.
FIG. 8 is a schematic diagram of a convolution window pipeline in an embodiment of the present invention.
FIG. 9 is a schematic diagram of an adaptive accelerator design flow in an embodiment of the invention.
FIG. 10 is a flow chart of adaptive selection of an optimal scheme from a convolutional layer structure in an embodiment of the invention.
Detailed Description
The following describes the embodiments of the present invention further with reference to the drawings. Embodiments of the present invention are not limited thereto.
A design method of a self-adaptive convolution layer hardware accelerator. Let N be the number of input channels, W be the width of the input feature map, H be the height of the input feature map, M be the number of convolution kernels, W be the number of output channels k Is the width of convolution kernel, H k High as convolution kernel, W s For convolution kernel wide step length, H s For convolution kernel high step length, the width W of the characteristic diagram is output o High H of output feature map o And the number G of convolution windows generated by each input channel satisfies the following conditions:
G=W o *H o #(3)
the convolution operation formula is:
wherein the method comprises the steps ofOutput of g window representing mth output channel,/>The ith row and ith column values of the jth window, w, representing the nth input channel of the input profile mnij The ith row and column weights of the nth channel representing the mth convolution kernel, b m Representing the offset of the mth convolution kernel.
Symbol description:
according to the formula (4), the intermediate result of the convolution of the g-th convolution window of the n-th input channel and the n-th channel of the m-th convolution kernelAs follows, +..
The mth channel of the output is convolved by the g-th convolution windowThe calculation is as follows:
definition matrix A (g) Wherein the mth row and the nth column data are
Matrix a (g) N-th column vector of (a)Is that
Matrix A (g) Is the m-th row vector of (2)Is that
A convolutional layer bias vector b, where the bias of the mth output channel is b m
Outputting the feature map vector C (g) Wherein the value of the mth output channel of the g-th convolution window is
From (7) (8), the result matrix A in the convolution of the g-th convolution window can be obtained (g) Satisfy the following requirements
From (7) and (9), it can be derived that
From equation (6) and definitions (7) (8) (10) (11) can be deduced
From equation (6) and definition (9) can be deduced
Step one: as shown in figure 1 of the drawings,and w is equal to ·n Each element in (2) is obtained by (5) (8)>The process calculates m output channels in parallel, thus called output channel parallelism, and encapsulates the structure of FIG. 1 as an output channel parallel module with input of +.>And w ·n Output is +.>
Step two: as shown in FIG. 5, the output channel parallel module in the first step is utilized to pipeline n input channels, namely, one input channel is input every clock period, and A is obtained according to (12) (g) Obtaining the values C of all convolution output channels according to (14) (g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme one. Assuming that the required clock period after one convolution window is calculated is T, and the addition operation adopts addition tree operation, the analysis can know the required clock period of the whole convolution operationThe period is T+N+G. The number of consumed multipliers is M x W k *H k The number of the adders is
Step three: as shown in FIG. 2, the sum (12) of the output channel parallel modules in FIG. 1 is performed on all N input channels to obtain A (g) Re-using (14) to obtain the values C of all convolution output channels (g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme two. Assuming that the required clock period after one convolution window is calculated is T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+G. The number of consumed multipliers is N M W k *H k The number of the adders is
Step four: as shown in fig. 3, x (g) And w is equal to m Obtained by using (5) and (9)In using (6) get->The process computes N input channels in parallel, called input channel parallelism, and encapsulates the structure of FIG. 3 as an input channel parallel module, the input of which is x (g) 、w And b m Output is +.>
Step five: as shown in FIG. 6, M convolution kernels are pipelined by the input channel parallel module in the fourth step, i.e., one output channel is input per clock, and A is obtained according to (13) (g) Obtaining the values C of all convolution output channels according to (11) and (15) (g) . As shown in FIG. 8, G convolution windows are pipelinedAnd (5) performing line operation to obtain all convolution output characteristic diagrams. This scheme is referred to as parallel acceleration scheme three. Assuming that the required clock period after one convolution window is calculated as T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+M+G. The number of consumed multipliers is N x W k *H k The number of the adders is
Step six: dividing N input channels into Q parts, and setting the number of each input channel to be equal
The number of the first Q-1 parts of input channels is u, and the number of the Q-th part of input channels is only N-uQ, so that the Q-th part fills up u (Q+1) -N input channels with the value of 0.
As can be seen from (12), A (g) Divided into Q parts such that the Q-th input channel has a subscript range of [ (Q-1) u+1, qu)]Wherein the convolution intermediate output corresponding to the g convolution window of the q-th input channel is
Let the mth convolution output channel intermediate value of the kth input channel kth windowIs that
Then from equation (6)
Let the g window of the q input channel output the characteristic diagramIs that
Then from the formulas (19) (20)
As shown in FIG. 4, the q-th part of all the u input channels are subjected to the output channel parallel module operation in FIG. 1 to obtain the output channel (17)Then from (18) (20)>As shown in FIG. 7, the Q input channels are pipelined again, and the values C of all convolution output channels are calculated from (21) (g) . As shown in FIG. 8, the G convolution windows are pipelined to obtain all convolution output feature maps. This scheme is referred to as parallel acceleration scheme four. Assuming that the required clock period after one convolution window operation is T, and the addition operation adopts addition tree operation, the analysis shows that the required clock period of the whole convolution operation is T+Q+G. The number of consumed multipliers is u M W k *H k The number of the adders is
Step eight: as shown in fig. 9, a convolution layer parameter and a convolution layer structure are obtained from an input source, the convolution layer parameter comprises a weight w and a bias b, and a data file with the parameter converted into a hardware format is stored in a memory; the convolution layer structure comprises the number N of input channels of the input feature diagram, the width W of the input feature diagram, the height H of the input feature diagram and the number of convolution kernelsI.e. the number M of output channels, the width W of the convolution kernel k High H of convolution kernel k Convolution kernel width step W s Convolution kernel high step H s Storing the values in a parameter file; and selecting an optimal acceleration scheme from the accelerator scheme pool by the parameters, and finally generating the convolution hardware accelerator.
Step nine: as shown in fig. 10, the flow of selecting the optimal accelerator scheme is:
1. specifying the threshold N of the number of input channels i Output channel number threshold N o
1. Firstly judging whether a scheme is manually specified, if so, selecting the scheme and ending, otherwise, executing 2;
2. judging whether the hardware resource consumption and the speed requirements are given, if so, calculating the hardware resource consumption and the running speed of all schemes in the accelerator scheme pool, and executing 3, otherwise, executing 4;
3. judging whether a scheme meeting the requirements exists or not, if so, selecting the scheme and ending, otherwise, executing 7;
4. if the number of input channels is less than N i Executing 5, otherwise executing 6;
5. if the number of output channels is less than N o Selecting a second hardware accelerator scheme, otherwise selecting the first or second hardware accelerator scheme; this scheme is then used and 7 is performed;
6. if the number of output channels is less than N o Selecting a third hardware accelerator scheme, otherwise selecting a fourth hardware accelerator scheme; this scheme is then used and 7 is performed;
7. judging whether to continue, if yes, executing 1, otherwise ending.

Claims (5)

1. The design method of the self-adaptive convolution layer hardware accelerator is characterized by comprising the following steps of:
(1) Analyzing the convolution layer structure, designing four different hardware accelerator schemes aiming at different convolution layer structures, and storing the four different hardware accelerator schemes in an accelerator scheme poolIn (a) and (b); by user specification of the threshold N of the number of input channels i And output channel number threshold N o The convolution layer structure is divided into the following four types: first, the number of input channels is less than N i The number of the output channels is less than N o The method comprises the steps of carrying out a first treatment on the surface of the Second, the number of input channels is less than N i The number of the output channels is greater than N o The method comprises the steps of carrying out a first treatment on the surface of the Third, the number of input channels is greater than N i The number of the output channels is less than N o The method comprises the steps of carrying out a first treatment on the surface of the Fourth, the number of input channels is greater than N i The number of the output channels is greater than N o
(2) Acquiring a convolution layer structure and convolution layer parameters from an input source, selecting an optimal accelerator scheme from an accelerator scheme pool according to the convolution layer structure, and constructing a corresponding convolution layer accelerator by the accelerator scheme;
the accelerator scheme pool contains the following hardware accelerator schemes:
the first parallel acceleration scheme is to perform parallel operation on the output channel and pipeline operation on the input channel and the convolution window respectively;
a parallel acceleration scheme II carries out parallel operation on the output channel and the input channel and carries out pipeline operation on the convolution window;
a parallel acceleration scheme III, which carries out parallel operation on the input channel and respectively carries out pipeline operation on the output channel and the convolution window;
a parallel acceleration scheme IV, which carries out parallel operation on part of input channels and output channels and respectively carries out pipeline operation on part of input channels and convolution windows;
storing four hardware accelerator schemes in a memory area, referred to as an accelerator scheme pool;
the hardware accelerator is generated by an optimal accelerator scheme and convolutional layer parameters;
the parallel acceleration scheme IV comprises the steps of dividing an input channel into a plurality of equal parts, and carrying out convolution operation on a convolution window of the plurality of input channels of each part and all convolution kernels; then carrying out pipeline operation on a plurality of input channels so as to obtain a convolution window convolution output of all the input channels; then, carrying out pipeline operation on the convolution window to obtain convolution output of all input channels;
the optimal accelerator scheme is selected as follows:
1. judging whether the first convolution layer structure belongs to the first convolution layer structure, if so, preferentially adopting a second acceleration scheme, otherwise, executing the step 2;
2. judging whether the first convolution layer structure belongs to a second convolution layer structure, if so, preferentially adopting a first or second acceleration scheme, otherwise, executing 3;
3. judging whether the data belongs to a third convolution layer structure, if so, preferentially adopting a third acceleration scheme, otherwise, executing the step 4;
4. the structure necessarily belongs to a fourth convolution layer structure, and a fourth acceleration scheme is preferentially adopted.
2. The hardware accelerator design method of claim 1, wherein: the step (2) specifically comprises: obtaining the height and width of an input feature map and the number of input channels of the input feature map from an input source, and obtaining the height and width of convolution kernels, the number of convolution kernels, and the width step length and the height step length; obtaining an input feature map, and a value of a convolution layer weight and a convolution layer bias; estimating the hardware resources consumed by each acceleration scheme and the required clock period by the parameters of the convolution layer; the results of these estimations are combined with the user's task-constrained requirements to select the optimal accelerator scheme to generate the convolutional layer accelerator.
3. The hardware accelerator design method of claim 1, wherein: the specific steps for acquiring the convolution layer structure and parameters from the input source are as follows:
1) Acquiring the shape of the weight tensor of the convolution layer, so as to analyze the number of convolution kernels of the convolution layer, the size of the convolution kernels and the step length;
2) Acquiring the shape of the tensor of the input feature map of the convolution layer, analyzing the size of the input feature map of the convolution layer and the number of input channels;
3) And quantizing the values of the convolutional layer input feature map and the values of the convolutional layer weights and offsets and converting the values into a hardware format data file.
4. The hardware accelerator design method of claim 1, wherein: the step (2) specifically comprises:
a. acquiring a convolution layer structure and parameters from an input source, wherein the convolution layer structure and parameters comprise a file containing a definition of the convolution layer structure and a data file containing a weight of the convolution layer and a bias of the convolution layer;
b. and selecting an optimal accelerator scheme by the structural parameters of the convolution layer, namely the size of the convolution kernel, the size of the input channel, the size of the output channel and the size of the convolution step, and generating a corresponding convolution layer accelerator.
5. The hardware accelerator design method of claim 1, wherein: the convolution layer parameters include weights and offsets; the convolution layer structure comprises the number of input channels of the input feature diagram, the width of the input feature diagram, the height of the input feature diagram, the number of convolution kernels, namely the number of output channels, the width of the convolution kernels, the height of the convolution kernels, the width step length of the convolution kernels and the height step length of the convolution kernels.
CN201811537915.0A 2018-12-15 2018-12-15 Design method of self-adaptive convolution layer hardware accelerator Active CN109740731B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811537915.0A CN109740731B (en) 2018-12-15 2018-12-15 Design method of self-adaptive convolution layer hardware accelerator
PCT/CN2019/114910 WO2020119318A1 (en) 2018-12-15 2019-10-31 Self-adaptive selection and design method for convolutional-layer hardware accelerator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811537915.0A CN109740731B (en) 2018-12-15 2018-12-15 Design method of self-adaptive convolution layer hardware accelerator

Publications (2)

Publication Number Publication Date
CN109740731A CN109740731A (en) 2019-05-10
CN109740731B true CN109740731B (en) 2023-07-18

Family

ID=66360373

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811537915.0A Active CN109740731B (en) 2018-12-15 2018-12-15 Design method of self-adaptive convolution layer hardware accelerator

Country Status (2)

Country Link
CN (1) CN109740731B (en)
WO (1) WO2020119318A1 (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740731B (en) * 2018-12-15 2023-07-18 华南理工大学 Design method of self-adaptive convolution layer hardware accelerator
CN110084363B (en) * 2019-05-15 2023-04-25 电科瑞达(成都)科技有限公司 Deep learning model acceleration method based on FPGA platform
CN110263923B (en) * 2019-08-12 2019-11-29 上海燧原智能科技有限公司 Tensor convolutional calculation method and system
CN110503201A (en) * 2019-08-29 2019-11-26 苏州浪潮智能科技有限公司 A kind of neural network distributed parallel training method and device
CN110929860B (en) * 2019-11-07 2020-10-23 深圳云天励飞技术有限公司 Convolution acceleration operation method and device, storage medium and terminal equipment
CN111047010A (en) * 2019-11-25 2020-04-21 天津大学 Method and device for reducing first-layer convolution calculation delay of CNN accelerator
CN111738433B (en) * 2020-05-22 2023-09-26 华南理工大学 Reconfigurable convolution hardware accelerator
CN111931909B (en) * 2020-07-24 2022-12-20 北京航空航天大学 Lightweight convolutional neural network reconfigurable deployment method based on FPGA
CN114186677A (en) * 2020-09-15 2022-03-15 中兴通讯股份有限公司 Accelerator parameter determination method and device and computer readable medium
CN112950656A (en) * 2021-03-09 2021-06-11 北京工业大学 Block convolution method for pre-reading data according to channel based on FPGA platform
CN115145839B (en) * 2021-03-31 2024-05-14 广东高云半导体科技股份有限公司 Depth convolution accelerator and method for accelerating depth convolution
CN115146767B (en) * 2021-03-31 2024-05-28 广东高云半导体科技股份有限公司 Two-dimensional convolution accelerator and method for accelerating two-dimensional convolution

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869117B (en) * 2016-03-28 2021-04-02 上海交通大学 GPU acceleration method for deep learning super-resolution technology
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator
EP3346427B1 (en) * 2017-01-04 2023-12-20 STMicroelectronics S.r.l. Configurable accelerator framework, system and method
CN207440765U (en) * 2017-01-04 2018-06-01 意法半导体股份有限公司 System on chip and mobile computing device
US20180189641A1 (en) * 2017-01-04 2018-07-05 Stmicroelectronics S.R.L. Hardware accelerator engine
CN107993186B (en) * 2017-12-14 2021-05-25 中国人民解放军国防科技大学 3D CNN acceleration method and system based on Winograd algorithm
CN108280514B (en) * 2018-01-05 2020-10-16 中国科学技术大学 FPGA-based sparse neural network acceleration system and design method
CN108133270B (en) * 2018-01-12 2020-08-04 清华大学 Convolutional neural network acceleration method and device
CN108805267B (en) * 2018-05-28 2021-09-10 重庆大学 Data processing method for hardware acceleration of convolutional neural network
CN108875915B (en) * 2018-06-12 2019-05-07 辽宁工程技术大学 A kind of depth confrontation network optimized approach of Embedded application
CN109740731B (en) * 2018-12-15 2023-07-18 华南理工大学 Design method of self-adaptive convolution layer hardware accelerator

Also Published As

Publication number Publication date
CN109740731A (en) 2019-05-10
WO2020119318A1 (en) 2020-06-18

Similar Documents

Publication Publication Date Title
CN109740731B (en) Design method of self-adaptive convolution layer hardware accelerator
US10691996B2 (en) Hardware accelerator for compressed LSTM
CN109948029B (en) Neural network self-adaptive depth Hash image searching method
CN112001294B (en) Vehicle body surface damage detection and mask generation method and storage device based on YOLACT++
CN110188866B (en) Feature extraction method based on attention mechanism
CN111861906A (en) Pavement crack image virtual augmentation model establishment and image virtual augmentation method
CN112036475A (en) Fusion module, multi-scale feature fusion convolutional neural network and image identification method
CN110874636A (en) Neural network model compression method and device and computer equipment
CN113222998B (en) Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
US20210182357A1 (en) System and method for model parameter optimization
CN113065653A (en) Design method of lightweight convolutional neural network for mobile terminal image classification
CN116363423A (en) Knowledge distillation method, device and storage medium for small sample learning
Lin et al. Aacp: Model compression by accurate and automatic channel pruning
CN109145738B (en) Dynamic video segmentation method based on weighted non-convex regularization and iterative re-constrained low-rank representation
Cao et al. Improving prediction accuracy in LSTM network model for aircraft testing flight data
CN116563355A (en) Target tracking method based on space-time interaction attention mechanism
CN114581789A (en) Hyperspectral image classification method and system
CN117058235A (en) Visual positioning method crossing various indoor scenes
CN111814884A (en) Target detection network model upgrading method based on deformable convolution
CN108830802B (en) Image blur kernel estimation method based on short exposure image gradient guidance
CN110647917A (en) Model multiplexing method and system
CN116152263A (en) CM-MLP network-based medical image segmentation method
CN110457155A (en) A kind of modification method, device and the electronic equipment of sample class label
Kadu et al. Human motion classification and management based on mocap data analysis
CN113033106A (en) Steel material performance prediction method based on EBSD and deep learning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant