CN112116084A

CN112116084A - Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform

Info

Publication number: CN112116084A
Application number: CN202010965915.1A
Authority: CN
Inventors: 宫磊; 王超; 朱宗卫; 李曦; 陈香兰; 周学海
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-09-15
Filing date: 2020-09-15
Publication date: 2020-12-22

Abstract

The invention discloses a convolution neural network hardware accelerator for solidifying full network layers on a reconfigurable platform, which comprises: the control module is used for coordinating and controlling the acceleration process, including initialization and synchronization of other modules on the chip and starting interaction of different types of data between each computation core and the off-chip memory; the data transmission module comprises a memory controller and a plurality of DMA (direct memory access) for data interaction between each on-chip data cache and the off-chip memory; the calculation module comprises a plurality of calculation cores for calculation, and each calculation core corresponds to different network layers of the convolutional neural network one to one; each computing core is respectively used as the first stage of the pipeline, and all computing cores jointly form a complete coarse-grained pipeline structure; each computing core internally contains a fine-grained computing pipeline. The invention improves the adaptability between the software and hardware characteristics and improves the utilization efficiency of computing resources by realizing the end-to-end mapping between the hierarchical computation and the hardware structure.

Description

Convolution neural network hardware accelerator capable of solidifying full network layer on reconfigurable platform

Technical Field

The invention belongs to the technical field of hardware acceleration of a convolutional neural network, and particularly relates to a hardware accelerator of a convolutional neural network for solidifying a full network layer on a reconfigurable platform and an acceleration method.

Background

With the increase of learning and classification capabilities, the deployment scale of the convolutional neural network in the cloud and the terminal is increased year by year. In order to solve the problems of more abstract and complex classification and learning in the real world, the scale of the convolutional neural network is continuously increased, and the computational complexity and the data volume are also increased sharply. The 2012 Google Cat network system has around 10 hundred million neuronal connections. The network model of VGG19 appeared in 2014, which has 1.4 hundred million neuron connections, and one feedforward process requires nearly 400 hundred million operations. On the other hand, because the general-purpose computing platform is difficult to perform targeted optimization in the aspects of computing and storage structures, data stream scheduling and the like, the problem that the computing resource requirement is large, the computing time is long, the computing power consumption is large and the like exists in the arrangement of the convolutional neural network with both computing-intensive and data-intensive characteristics on the CPU and the GPU. For example, the running process of the go program AlphaGo requires one thousand CPUs and two hundred GPUs to calculate simultaneously in the background, each step of the go requires several minutes of reasoning, and the next board of the go requires up to three thousand dollars of power supply. Therefore, how to deploy the convolutional neural network with high performance and low energy consumption becomes a research hotspot of academia and commercial institutions.

By comprehensively balancing performance and energy efficiency, hardware acceleration technology based on an ASIC and an FPGA is generally adopted in all fields at present, and the deployment efficiency of the convolutional neural network is improved by providing higher parallelism on a computing level and providing an on-chip cache strategy which is consistent with data locality characteristics on an access level. The ASIC is an integrated circuit chip designed and developed for specific purposes, and has the characteristics of high performance, low power consumption, small area and the like. The FPGA is a typical reconfigurable device, a large number of logic circuits which can be configured repeatedly are contained in the FPGA, the FPGA has good customizability and expandability, and the operating requirements of high performance and low power consumption of specific applications can be met, so that a higher energy efficiency ratio is obtained.

At present, a mainstream hardware accelerator generally adopts two types of calculation structures, namely a vector inner product unit and a two-dimensional array, to process matrix and vector operations in a convolutional neural network feedforward process. When calculating the mapping, the vector inner product unit is mainly based on cycle expansion and cycle slicing. The two-dimensional array is formed by establishing interconnected data paths among the PEs, so that data can flexibly flow in the PE array to improve the calculation parallelism and the data reuse degree.

Based on the computing structure, the current mainstream ASIC and FPGA hardware accelerator generally adopts an acceleration mode of multiplexing computing components between layers, that is, all computing components are organized into a plurality of isomorphic processing units and then externally packaged into a unified single-core computing whole, and different network layers adopt a multiplexing form to compute layer by layer. This on-chip architecture and acceleration method, which originally enabled ASIC accelerators to be adapted to more neural network types, has a mismatch problem with the inherent computational characteristics of convolutional neural networks. When convolutional neural network hardware deployment is performed on reconfigurable computing platforms such as an FPGA (field programmable gate array), the mismatch is highlighted by the customizability of the hardware, so that the hardware acceleration efficiency is seriously reduced, and the following aspects are embodied.

Different network layers in a convolutional neural network have large differences in parallelism in different dimensions, which conflicts with a single, fixed hardware parallelism. Each network layer in the convolutional neural network has a large amount of intra-layer calculation parallelism in different dimensions such as an input characteristic diagram, an output characteristic diagram, an input neuron and an output neuron. However, a single and homogeneous hardware computing structure can only perform hierarchical computation with a fixed parallelism in a fixed dimension, so that the parallelism of the bottom hardware and the upper application is not matched, which results in idle of part of on-chip resources in the computing process.

The computational parallelism within the limited layer that the single-core acceleration mode can mine conflicts with a large amount of parallel computational resources that can be provided by the current FPGA equipment. With the continuous progress of the manufacturing process, the floating point computing performance of modern FPGA chips can reach 10TFLOPS magnitude. In the single-core acceleration mode, on one hand, the parallelism of each dimension in each layer is limited, and the difference of the parallel characteristics of different layers in different dimensions is obvious, so that the effective calculation parallelism which can be deployed by the accelerator is limited. On the other hand, the layer-by-layer acceleration mode of the single core cannot mine the calculation parallelism between network layers. Due to two reasons, the whole parallelism of the existing hardware accelerator structure is limited, and various logic resources of the FPGA are difficult to be fully and efficiently utilized in the accelerator deployment process.

The access characteristics of different network layers are very different, which conflicts with the acceleration mode of different layers multiplexing the same hardware unit. On one hand, the data composition proportion and the data quantity difference of different network layers are obvious, and the multiplexing between different layers by using the same on-chip data caching strategy is difficult to achieve the optimal. Taking Caffeine deployment on VGG16 as an example, the computational parallelism of most convolution layers in the model after cyclic slicing is completely the same, so that computational efficiency reduction cannot be caused by computation unit multiplexing. However, the average performance of the accelerator processing convolutional layers is only 76.7% of the peak performance, which causes a serious off-chip access bottleneck mainly due to the data caching policy.

On the other hand, two main type layers in the convolutional neural network calculation, namely a convolutional layer and a full-connection layer, are respectively calculation-intensive and access-intensive, and the data locality and the off-chip access memory bandwidth overhead in the calculation process are different in order of magnitude. Under the single-core hardware structure, different types of layers can only adopt an acceleration mode of time division multiplexing, so that the utilization rate of the whole off-chip access memory bandwidth in the computation process of the convolutional layer is insufficient, the computation capability on the chip cannot be fully exerted due to the access memory bottleneck in the computation process of the full-connection layer, and the utilization of the off-chip access memory bandwidth in the acceleration process is unbalanced. Also taking the Caffeine deployment to the VGG16 as an example, the full connection layer which only occupies less than 1% of the total calculation amount reduces the overall calculation performance by about 20%, and the hardware acceleration efficiency is seriously affected.

Disclosure of Invention

In view of the above technical problems, the present invention aims to: the convolutional neural network hardware accelerator and the acceleration method for curing the full network layer on the reconfigurable platform are provided, the heterogeneous multi-core accelerator structure and the acceleration method are systematically provided on the reconfigurable platform, and the problem of software and hardware feature mismatch in convolutional neural network hardware acceleration is effectively solved. By realizing end-to-end mapping between the hierarchical computation and the hardware structure, the adaptability between the software and hardware characteristics is improved, the waste of a large amount of hardware resources in the traditional convolutional neural network accelerator design is avoided, and the utilization efficiency of the computation resources is improved.

The technical scheme of the invention is as follows:

a convolutional neural network hardware accelerator that can reconfigure a cured full network layer on a platform, comprising:

the control module is used for coordinating and controlling the acceleration process, including initialization and synchronization of other modules on the chip and starting interaction of different types of data between each computation core and the off-chip memory;

the data transmission module comprises a memory controller and a plurality of DMA controllers and is used for data interaction between each on-chip data cache and the off-chip memory;

the calculation module comprises a plurality of calculation cores for calculation, and each calculation core corresponds to different network layers of the convolutional neural network one to one; different computation cores respectively have weight chip access circuits which are logically independent from each other, a unidirectional on-chip data access circuit for transmitting input feature map data is arranged between the front computation core and the rear computation core, and the first computation core and the last computation core respectively have input/output chip access circuits in a feed-forward process; each computing core is respectively used as the first stage of the pipeline, and all computing cores jointly form a complete coarse-grained pipeline structure; each computing core internally comprises a fine-grained computing pipeline; the core assembly line corresponding to each convolution layer is divided into four stages, namely fetching, calculating, activating and outputting, and the parallelism of the assembly line is independently designed and optimized according to the calculation parallel mode of the corresponding network layer.

In a preferred technical scheme, different computing cores in the computing module internally comprise respective local ping-pong cache sets, design parameters of the local ping-pong cache sets are independently adjusted according to parallelism of an in-core pipeline, and computing results of the computing cores are directly input to subsequent computing cores in a streaming mode.

In the preferred technical scheme, a direct mapping strategy is adopted to buffer all input characteristic graphs required by each round of calculation of each calculation core on a chip; the direct mapping strategy comprises convolutional layer cyclic expansion, fragmentation and full connection layer fragmentation;

the size of M, N layers of circulation is T in the unwinding and slicing of the convolution layer circulation_m、T_nWhile the circulation of the R layer is expanded, the circulation of the R layer is carried out to be T_rThe vector inner product unit structure is obtained by expansion, and the calculation parallelism of the output characteristic diagram of the vector inner product unit structure is T_mEach output characteristic diagram is internally provided with T_rThe output neurons are calculated simultaneously, and the calculation parallelism of each output neuron is T_nTotal computational parallelism of T_m×T_n×T_rThe on-chip weight cache capacity is set to T_m×N×K²K is the corresponding convolution window size;

the full-connected layer slice is T in size in dimension of input neuron and output neuron_mAnd T_nIs calculated with a corresponding computation parallelism of T_m×T_nThe on-chip weight cache size is set to T_m×T_nAnd all input characteristic graphs of the full connection layer are cached on the chip, and the full connection layer is subjected to sparse processing when the accelerator is deployed.

In the preferred technical scheme, an interlayer fusion mapping strategy is adopted to cache all input characteristic graphs required by each round of calculation of each calculation core on a chip; the interlayer fusion mapping strategy comprises the following steps: and determining sub-regions of output characteristic graphs of each layer after interlayer fusion, and applying the interlayer fusion to a parallel mode of interlayer computing flow by changing the processing granularity of flow levels among the layers.

In a preferred technical solution, the inter-layer fusion mapping policy further includes a multi-level inter-layer fusion mapping manner, and the fusion rule includes: the first layer is used as a fusion starting point, and if the R, C layer cycle size of the subsequent convolutional layer is larger than or equal to 1/5 of the cycle size of the first layer R, C, the layers are fused into the same level; if R, C layers of the convolutional layer have less circulation than the first layer 1/5, the layer is taken as the fusion starting point of the new layer, and so on;

when multi-level interlayer fusion is adopted, all input feature maps of network layers serving as fusion connection points are cached on a chip, after the interlayer fusion, the size of a sub-region of a tail convolution layer in each level of fusion is determined, and the sizes of corresponding sub-regions of other layers in the same level of fusion are determined according to the R, C cycle size relation between the sub-regions and the tail convolution layer; each convolution layer calculation core adopts the circulation unfolding and slicing algorithm of the convolution layer to finish calculation in each interlayer coarse-grained pipeline level, and the original R, C layers of circulation size are correspondingly changed into the height and the width of the sub-region of the output characteristic diagram;

the mapping mode of the full connection layer is the same as the direct mapping strategy, and after the interlayer fusion mapping mode is adopted, the total parallel deployment parameters of the accelerator calculate the cycle expansion size T of the core for each convolution layer_m、T_n、T_rThe sizes R ', C' of the subregions of the last convolutional layer in the fusion of the levels and the cyclic expansion size T of the fully-connected layer_m、T_n。

In an optimized technical scheme, the control module further includes a data access optimization module for performing data access optimization on the full connection layer, the data access optimization module includes a full connection layer balance pruning module for pruning and compressing the full connection layer, and the weight redundancy and the irregularity of the weight data after pruning are eliminated in a retraining manner, specifically including:

setting a probability value positively correlated with the number of the left weights for the output neurons in the training process, preferentially cutting the weights of the output neurons with larger probability values in each round of pruning, and compensating the precision loss through retraining;

for unbalanced output neurons, the weight is filled with a value of 0;

compressing the sparse weight, storing the residual weight after pruning line by line according to the sequence of the output neurons in the compressed weight matrix, corresponding elements in the position information matrix to the weights in the compressed weight matrix one by one, recording the index value of the residual weight of each output neuron in the original weight matrix, and storing and calculating the weight data in a compressed form in the calculation process.

In the preferred technical scheme, the system also comprises a full connection layer calculation core unit which comprises a double-index cache and an input neuron selector, wherein the double-index cache and the input neuron selector are respectively used for caching weight value position information of the full connection layer and selecting a corresponding input neuron from the input cache according to the position information; when the parallelism of the calculation is T_m×T_nIn the computing process, the computing core reads T through off-chip memory access_m×T_nThe weight data and the corresponding position information are sent to T_mA vector multiply-add unit for sending the position information to the input neuron selector for T_mEach vector multiply-add unit selects a corresponding T_nThe input neurons complete inner product calculation and nonlinear transformation operation through the vector multiplication and addition unit; calculate the required T of the next round_m×T_nAnd reading the weight data and the corresponding position information into the corresponding double caches.

In a preferred technical scheme, the data access and memory optimization module further comprises an interlayer pipelining semi-batch processing module, each convolution layer sequentially calculates different feedforward input data according to a non-batch processing mode, and the full-connection layer intensively processes the batch size input data in a batch processing mode after the convolution layer finishes the calculation of the batch size input data.

In the preferred technical scheme, comparing the on-chip cache capacity of a target hardware platform with the total input feature map amount of an accelerated network model, if the on-chip cache capacity of the target hardware platform is larger than the total input feature map data amount of each layer of the network, determining an optimal parallel strategy corresponding to a direct mapping strategy and an optimal parallel strategy corresponding to an interlayer fusion mapping strategy, and selecting a final parallel strategy by comparing the optimal parallel strategy corresponding to the direct mapping strategy and the optimal parallel strategy corresponding to the interlayer fusion mapping strategy; otherwise, determining an optimal parallel strategy corresponding to the interlayer fusion mapping strategy, and taking the optimal parallel strategy as a final parallel strategy; the method for determining the optimal parallel strategy comprises the following steps:

(1) constructing a combined optimization model of direct mapping and interlayer fusion mapping, wherein the combined optimization model comprises the steps of establishing overall throughput corresponding to different parallel strategies and overall calculation and memory access ratios corresponding to the different parallel strategies;

(2) establishing constraint conditions corresponding to different parallel strategies, wherein the constraint conditions comprise DSP resource expenses corresponding to different vector inner product unit structures and BRAM resource expenses corresponding to different cache structures;

(3) solving a multivariate combined optimization model based on a genetic algorithm, comprising the following steps: encoding the parallel variables into chromosomes in the form of integer arrays, wherein any element of the chromosomes corresponds to a bit variable; carrying out random initialization of variable combinations on chromosomes in the original population; in each round of evolution, firstly, calculating the fitness corresponding to each chromosome, and determining the genetic probability of different chromosomes according to the fitness; the fitness in different evolution stages is different, and in the first half-evolution algebraic-wheel evolution, the fitness is defined as the product of a target function; in the second half evolutionary algebraic wheel evolution, defining fitness as the product of two objective functions divided by the ratio of the corresponding access cost to the upper limit of the access bandwidth of the hardware platform; in the crossing and mutation process, variables in the chromosome are brought into constraint conditions for examination before exchange and updating.

The invention also discloses a convolution neural network hardware acceleration method for solidifying the full network layer on the reconfigurable platform, which adopts the hardware accelerator and comprises the following steps:

s01: comparing the on-chip cache capacity of the target hardware platform with the total input characteristic diagram amount of the accelerated network model;

s02: if the on-chip cache capacity of the target hardware platform is larger than the total input feature map data quantity of each layer of the network, determining an optimal parallel strategy corresponding to the direct mapping strategy and an optimal parallel strategy corresponding to the interlayer fusion mapping strategy, and selecting a final parallel strategy by comparing the optimal parallel strategy corresponding to the direct mapping strategy and the optimal parallel strategy corresponding to the interlayer fusion mapping strategy;

s03: otherwise, determining an optimal parallel strategy corresponding to the interlayer fusion mapping strategy, and taking the optimal parallel strategy as a final parallel strategy; the method for determining the optimal parallel strategy comprises the following steps:

Compared with the prior art, the invention has the advantages that:

1. the invention adopts a multi-core heterogeneous acceleration mode, and can allocate corresponding hardware resources to different layers of the network according to the scales of the different layers, so that the network layers with different sizes have relatively sufficient hardware resources, and the conflict between the parallelism of different network layers and the parallelism of isomorphic hardware is solved.

2. The invention can integrate a plurality of cores on the FPGA, and different cores process different network layers. By the aid of the method, sufficient requirements are required to be met on the FPGA at each time when the FPGA works, and conflicts between the in-layer parallelism degree mined in the single-core acceleration mode and parallel computing resources provided by the FPGA are relieved.

3. When the specific example is realized, the invention adopts a three-step method, firstly, each calculation core is independently deployed and optimized, then, different cores are reasonably organized macroscopically, and finally, on-chip calculation and off-chip access storage are coordinated by combining macroscopic observation and local observation. Through the three steps, the conflict between the access characteristics of different network layers and the isomorphic acceleration mode can be effectively solved.

Drawings

The invention is further described with reference to the following figures and examples:

FIG. 1 is an overall hardware architecture of a convolutional neural network hardware accelerator with a cured full network layer on a reconfigurable platform according to the present invention;

FIG. 2 is a schematic diagram of a vector inner product unit after the convolutional layer is circularly expanded according to the present embodiment;

FIG. 3 is a schematic diagram illustrating a calculation method of a fusion mapping policy between successive convolution layers according to this embodiment;

FIG. 4 is a schematic diagram of a multi-level inter-layer fusion mapping method according to this embodiment;

FIG. 5 is a schematic view of a pruning operation of a conventional full link layer;

FIG. 6 is a schematic diagram of balanced pruning of a fully-connected layer according to the present embodiment;

FIG. 7 is a diagram illustrating weight compression of the fully connected layer according to the present embodiment;

fig. 8 is a schematic structural diagram of a full link layer computation core supporting sparse processing according to the present embodiment;

fig. 9 is a schematic diagram of a semi-batch process in an AlexNet deployment according to this embodiment;

fig. 10 is a schematic diagram of a rofoline multi-core performance analysis model according to the present embodiment;

fig. 11 is a schematic diagram illustrating comparison between performance and energy efficiency of an accelerator, a CPU, and a GPU on the Zynq7020 platform according to this embodiment;

fig. 12 is a schematic diagram illustrating comparison between performance and energy efficiency of an accelerator, a CPU, and a GPU on the Virtex-7690T platform according to this embodiment.

Detailed Description

The above-described scheme is further illustrated below with reference to specific examples. It should be understood that these examples are for illustrative purposes and are not intended to limit the scope of the present invention. The conditions used in the examples may be further adjusted according to the conditions of the particular manufacturer, and the conditions not specified are generally the conditions in routine experiments.

Example (b):

the invention aims at the efficient hardware deployment of the convolutional neural network, combines the reconfigurable computing technology with the heterogeneous multi-core system structure, systematically provides the heterogeneous multi-core accelerator structure and the acceleration method on the reconfigurable platform, and effectively relieves the problem of software and hardware characteristic mismatch in the hardware acceleration of the convolutional neural network. The invention provides a heterogeneous multi-core accelerator deployment method for solidifying a full network layer on a chip, which improves the adaptability between software and hardware characteristics by realizing end-to-end mapping between hierarchical calculation and a hardware structure, avoids the waste of a large amount of hardware resources in the design of the traditional convolutional neural network accelerator, and improves the utilization efficiency of calculation resources.

The whole architecture of the accelerator with the on-chip curing full network layer designed by the invention is shown in figure 1, and mainly comprises a control module, a data transmission module and a calculation module.

The control module is mainly used for coordinating and controlling the whole acceleration process, including initialization and synchronization of other modules on the chip, and starting interaction of different types of data between each computation core and the off-chip memory. In actual deployment, the module may be written in a finite state machine form through a hardware description language, or may be directly deployed using a soft/solid core processor running software codes. Compared with the prior art, the software-controlled deployment mode has low design complexity. Considering that the number of computing cores in the on-chip structure of the accelerator is large and the control state is complex, a control module is deployed in a software control mode in design.

The data communication module mainly comprises a memory controller and a plurality of DMA controllers and is responsible for data interaction between each on-chip data cache and the off-chip memory. The memory controller acts as a data interface between the on-chip logic and the off-chip memory, and all off-chip data accesses are initiated by the memory controller. In the calculation process, the DMA controller is used as a communication medium of the on-chip data cache and the memory controller and is responsible for writing the input data of the network into the input cache of the first calculation core and reading the output of the last calculation core. In addition, the DMA controller is responsible for writing the weight data required by the calculation into the weight cache of the corresponding calculation core. Thus, the total number of DMA controllers is equal to the number of layers of the corresponding network model plus 2.

The computing module is composed of a plurality of computing cores and is responsible for specific computing processes. Each computational core is in one-to-one correspondence with different network layers, so that the total number of cores is the same as the number of layers of the accelerated network model. In the aspect of data interconnection, different computing cores respectively have weight chip external access memory paths which are logically independent from each other, a unidirectional on-chip data path for transmitting input characteristic diagram data is arranged between the front core and the rear core, and the core 0 and the core n-1 respectively have input/output chip external access memory paths in a feed-forward process. In the acceleration process, the calculation processes of different network layers are overlapped and pipelined, each calculation core is used as the first stage of the pipeline, and all the cores form a complete coarse-grained pipeline structure together. Each computing core internally comprises a fine-grained computing pipeline. On the whole, the in-core pipeline corresponding to each convolution layer is divided into four stages, namely fetching, calculating, activating and outputting, and the specific pipeline parallelism is independently designed and optimized according to the calculation parallel mode of the corresponding network layer. It should be noted that the calculation module may adopt any calculation manner including vector inner product, two-dimensional systolic array, etc. and may be deployed in any data type as required. Without loss of generality, we will describe the fixed-point number vector inner product unit as an example in the following.

Different from the uniform and shared on-chip cache adopted in the design of the traditional accelerator, different computing cores internally comprise respective local ping-pong cache sets, and design parameters such as capacity, partition strategies, the number of read/write ports and the like can be independently adjusted according to the parallelism of the flow lines in the cores. Output cache is not deployed in the accelerator structure, and the calculation result of each calculation core is directly input to a subsequent calculation core in a streaming mode. Therefore, unlike the way that the input and output feature maps are cached in a slicing manner in the conventional acceleration mode, all the input feature maps required by each round of calculation of each layer are completely saved on a chip. In addition, corresponding to the mining of the parallelism among network layers, under the condition that hardware resources are constant, the calculation parallelism in each layer is reduced compared with the traditional accelerator, and therefore the capacity requirement of each core on weight cache is correspondingly reduced. In general, the data caching strategy adopted in the invention can avoid off-chip reading and writing of characteristic diagram data of each layer, thereby being effectively coordinated with the calculation mode of interlayer flow.

(1) Computing mapping and parallelism policies

In order to facilitate the excavation of the interlayer parallelism in a network interlayer calculation pipelining mode, all input feature graphs required by each calculation of each calculation core need to be cached on a chip. However, for a large-scale network model, the amount of input feature map data of each layer is large, and it is difficult to cache all the input feature map data on a chip in some computing scenarios. Therefore, in order to enhance the applicability of the accelerator deployment method, two types of network mapping strategies, namely direct mapping and interlayer fusion, are designed according to the relationship between the total input feature diagram data volume of each layer of the network and the on-chip cache capacity of a target hardware platform.

(1.1) direct mapping policy for networks

The direct mapping strategy corresponds to the condition that the on-chip cache resources of the hardware platform are more abundant compared with the total amount of the input characteristic diagram data of the deployed network model. At this time, one stage of the coarse-grained pipeline corresponds to the calculation of a complete layer.

In the mapping strategy, the computation core of each convolutional layer performs cyclic unrolling and parallelization of convolutional layers according to the algorithm 1, wherein M, N is the number of output channels and the number of input channels, R, C is the height and width of an output feature map, K, S is the size of a convolution window and the size of convolution step, and m is、n、r、c、k₁、k₂Respectively, a loop variable, T, in the corresponding loop_r、T_cFor cyclic spreading of the slice size, T_m、T_nIs the loop unrolling size.

Algorithm 1: circulation spreading and slicing mode of convolution layer

Unlike traditional convolutional layer loop unrolling and slicing, on one hand, the R, C loop is not sliced in the dimension of the output feature map because the input neurons of each layer are all cached on the slice and the output neurons are directly sent to the subsequent computation core after computation is completed. On the other hand, in order to better balance the inter-layer computation pipeline level based on the loop expansion, M, N-layer loops are processed with the size T_m、T_nWhile the circulation of the R layer is expanded, the circulation of the R layer is carried out to be T_rThe cycle of (2) is expanded. The expanded vector inner product unit structure is shown in FIG. 2, and under the structure, the computation parallelism of the output feature map is T_mWithin each output profile there will be T_rThe output neurons are calculated simultaneously, and the calculation parallelism of each output neuron is T_nTotal computational parallelism of T_m×T_n×T_r. Correspondingly, the weight buffer capacity on the chip is set to T_m×N×K²Then the multiplexing degree of each weight data can reach the highest value R × C.

The mapping of the fully-connected layer is similar to the conventional processing method, as shown in algorithm 2, where M, N is the number of output and input neurons:

and 2, algorithm: full connectivity layer sharding computation process

Performing a size T in the dimension of the input and output neurons_mAnd T_nIs calculated with a corresponding computation parallelism of T_m×T_n. Unlike convolutional layer, the on-chip weight cache size can be directly set to T since the weight data reuse degree of the full link layer is independent of the cache size_m×T_n. Considering that the data volume of the input feature map of the full connection layer is generally small, the input feature map of the full connection layer is all cached on a chip in different mapping strategies. In addition, the fully-connected layer will be sparsified when the accelerator is deployed, and accordingly sparsification support is provided for the computation core, and details of implementation will be described in subsequent sections.

Because each computing core is in one-to-one correspondence with different layers, in the accelerator deployment process, the calling and control process of the vector inner product unit can be directly solidified in each computing core (corresponding to the outer four-layer loop in the algorithm 1 and the outer two-layer loop in the algorithm 2), so that the additional overhead caused by compiling and instruction control is avoided. In summary, when the accelerator deployment is performed by using direct mapping, respective parallel design parameters (T) need to be determined for the computation cores of each convolutional layer_m,T_n,T_r) And parallel parameters (T) of each full-link layer computational core_m,T_n)。

(1.2) mapping strategy based on interlayer fusion

The interlayer flow calculation mode needs to avoid off-chip memory read-write of input and output characteristic diagram data by each calculation core (except head and tail cores). In this regard, convolutional layer computation methods based on inter-layer fusion provide another solution in addition to caching all input bitmaps on-chip. The basic idea of interlayer fusion is shown in fig. 3, which is based on the fact that each output neuron of the subsequent convolutional layer has data correlation with only part of the calculation results of the previous convolutional layer in the calculation process. Therefore, sub-regions of output feature maps of each layer after interlayer fusion (corresponding to R, C cycles) are determined, and by changing the processing granularity of pipeline levels among the layers, the interlayer fusion can be applied to a parallel mode of interlayer computation pipelines, and the data volume of the input feature maps, which need to be cached in each coarse-granularity pipeline level computation, of each convolutional layer computation core can be effectively reduced. In addition, for the case where the R, C loop size is the same in the preceding and following layers, each output feature map subregion is processed in a block convolution manner when these layers are fused. Correspondingly, the sizes of the sub-areas of the front layer and the rear layer during the block convolution are the same, and the sub-areas adopt a filling mode of repeating edge pixels in the calculation process.

For a network model with a deep layer number, too many layers are fused, so that a large number of input feature maps still need to be cached in the calculation process. For the situation, the invention adopts a mapping mode of multilevel interlayer fusion. As shown in fig. 4, the

convolutional layers

1 and 2, and

convolutional layers

2, 3, and 4 are respectively subjected to interlayer fusion. In this case, the amount of buffered feature map data required by the network layer located at the front in the entire network will be significantly reduced compared to the conventional fusion method in fig. 3. In fact, FIG. 3 is given as a special case of multi-level layer fusion, where all convolutional layers are fused to the same level. In the selection of a specific multi-level interlayer fusion mode, in order to reduce the fusion difficulty, the invention analyzes the structural characteristics of a common large-scale network and determines the following fusion rule:

the first layer is used as a fusion starting point, and if the R, C cycle size of the subsequent convolutional layer is larger than or equal to 1/5 of the cycle size of the first layer R, C, the subsequent convolutional layers are fused into the same stage. For R, C layers, when a convolutional layer with a smaller loop than the first layer 1/5 occurs, this layer will serve as the start of fusion for the new level, and so on.

In addition, in order to facilitate computation coordination among the inter-layer pipelines corresponding to different fusion levels, when multi-level inter-layer fusion is adopted, the input feature maps of the network layers (such as the convolutional layer 3 in fig. 4) serving as fusion connection points are all cached on the chip.

After the inter-layer fusion, the sub-region sizes of the last convolutional layer in each level of fusion need to be determined, denoted as R ', C'. After the value is determined, the sizes of the corresponding sub-regions of other layers in the same level of layer fusion can be correspondingly determined according to the R, C cycle size relationship between the sub-regions and the last convolution layer. At this time, the computation performed by each convolutional layer computation core in each inter-layer coarse-grained pipeline stage can still be represented by algorithm 1, which is different from the original algorithmThe R, C layer loop size correspondence becomes the height and width of the output signature sub-region. The mapping mode of the full connection layer is the same as the direct mapping strategy, and after the interlayer fusion mapping mode is adopted, the total parallel deployment parameters of the accelerator calculate the cycle expansion size T of the core for each convolution layer_m、T_n、T_rThe sizes R ', C' of the subregions of the last convolutional layer in the fusion of the levels and the cyclic expansion size T of the fully-connected layer_m、T_n. Although the mapping strategy based on interlayer fusion can also be deployed in a small-scale network model, the reuse degree of weight data is reduced due to the limited input data amount of on-chip cache. Taking the last convolutional layer in each level of fusion as an example, the multiplexing times of each weight data is reduced from R × C times in the direct mapping to R 'and C' times. Therefore, in the actual accelerator deployment process, when the two mapping strategies are simultaneously applied, the optimal balance needs to be made according to the corresponding design space search method. In addition, since the weight data needs to be accessed from the outside of the chip again when different sub-regions are calculated, the weight cache only needs to store T in each interlayer pipeline stage_m×T_nThe calculation requirements can be satisfied by multiplying K by K weights.

(2) Overall memory access optimization

After the full network layer is solidified on the chip, in order to relieve the off-chip access pressure caused by the improvement of the whole calculation parallelism, the key point when the accelerator is deployed is to perform data access optimization on the full connection layer. Specifically, on one hand, the whole weight data volume is reduced by pruning and compressing the full connection layer and designing a corresponding calculation core to support the calculation of a sparse network layer; on the other hand, a semi-batch feedforward calculation mode is adopted, the reuse degree of the weight data of the full-connection layer is increased, and meanwhile, the data quantity of the input characteristic diagram required to be cached in the calculation process of the convolutional layer is guaranteed to be unchanged.

(2.1) Balanced pruning of fully connected layers

On one hand, the weight data amount of the convolutional layer is considered to be less, and usually only accounts for less than 10% of the total weight amount, and the redundancy is lower (35% vs 90%) compared with the weight data of the fully-connected layer; on the other hand, the number of convolutional layers is large, and if each convolutional layer calculation core has sparse calculation capacity, high extra hardware overhead is needed, so that only the full-connection layer is pruned when the accelerator is deployed.

However, the traditional pruning strategy increases the irregularity of the weight data, as shown in fig. 5, the difference of the number of the remaining weights corresponding to different output neurons after pruning is obvious, so that in the calculation process, on one hand, load imbalance among hardware calculation units is caused when different output neurons are processed in parallel, and on the other hand, additional hardware control logic overhead is introduced when irregular weight data is processed.

Aiming at the problem, the invention uses balance pruning for reference, and eliminates the redundancy of the weight value and the irregularity of the weight value data after pruning by matching with a retraining mode. Specifically, a probability value positively correlated with the number of the remaining weights is set for the output neurons in the training process, and then the weights of the output neurons with higher probability values are preferentially cut in each round of pruning. On the basis, the precision loss is made up through retraining. The fully connected layer after balanced pruning is shown in fig. 6.

For a small number of output neurons that are not yet balanced, they are balanced by filling in 0 values, such as the red weights in the figure. After the balanced pruning is completed, the sparse weights are compressed in the manner shown in fig. 7. Corresponding to the full connection layer after pruning on the right side of fig. 6, the left compressed weight matrix in fig. 7 stores the residual weights after pruning line by line according to the order of the output neurons, and the elements in the position information matrix on the right side correspond to the weights in the compressed weight matrix one to one and are used for recording the index values of the residual weights of each output neuron in the original weight matrix. The index value data bit width employed by the accelerator deployment is related to the number of input neurons of the fully-connected layer, typically no more than 14 bits. In the calculation process, the weight data is stored and calculated in a compressed form, so that the storage capacity and the off-chip access stock of the weight data are effectively reduced.

Corresponding to the balanced pruning and weight compression, in the present invention, a full-connectivity layer computational core structure as shown in fig. 8 is designed. Compared with convolution layer calculation core, the structure adds double index bufferAn index buffer and an input neuron selector (input neuron selector) for caching the weight location information in fig. 7 and selecting the corresponding input neuron from the input buffer according to the location information. When the parallelism of the calculation is T_m×T_nDuring the calculation process, the calculation core firstly reads T through the off-chip memory access_m×T_nThe individual weight data and the corresponding position information, and then the weight data is directly sent to T_mA vector multiply-add unit, position information is sent to the input neuron selector for T_mEach vector multiply-add unit selects a corresponding T_nAnd inputting the neurons, and then completing inner product calculation and nonlinear transformation operation by the vector multiplication and addition unit.

At the same time, the T required for the next round of calculation_m×T_nThe weight data and the corresponding location information are read into the corresponding double cache. Compared with the computation core of the convolutional layer, the fine-grained pipeline inside each fully-connected layer computation core comprises 5 pipeline stages: taking number, selecting number, inner product, activating and outputting. By the method, the computing, storing and memory accessing redundancy of the full connection layer can be effectively eliminated.

(2.2) semi-batch treatment in interlayer flow Water

In the conventional batch processing mode, the weight data reuse degree is improved, and meanwhile, the intermediate calculation results of each layer are increased, so that more input feature graphs need to be cached when batch processing is applied to a calculation structure of interlayer flow. Considering that the input feature map data volume of the full link layer is generally low, and caching all the input feature map data on a slice does not cause excessive cache resource overhead, a half-batch processing calculation mode is adopted in accelerator deployment, that is, each convolution layer sequentially calculates different feedforward input data according to a non-batch processing mode, and the full link layer waits until the convolution layer finishes calculating batch size (batch _ size) input data, and then processes batch _ size input data in a centralized mode in a batch processing mode.

Here, a half batch calculation manner in the inter-layer flow is described by taking the deployment of AlexNet as an example. As shown in fig. 9, the size of batch _ size is 4, that is, a batch of data includes 4 inputs, and there are 8 computation cores on the chip, each corresponding to 8 network layers of AlexNet, where cores 0 to 4 are convolutional layer cores and cores 5 to 7 are fully-connected layer cores. Each interlayer computation pipeline stage of the convolutional layer computation core corresponds to a single input in a batch of data, and a computation result is sent to a subsequent core after one pipeline stage is finished. Each inter-layer computation pipeline stage of the fully-connected layer core corresponds to all 4 inputs in a batch of data, and processing of the batch of data is started only after a batch of processing results of a previous core is received. In fig. 9, when Stage 3 ends, the core 4 completes processing of the 4 th input data in Batch 3, and transmits the calculation result of Batch 3 to the core 5. At Stage 4 start time, core 4 starts processing the first input data in Batch 4, core 5 starts batching Batch 3, and so on. In the semi-batch mode, the weight data multiplexing of the full link layer will be increased by the batch _ size times. In terms of on-chip cache, the input cache size of the convolutional layer computational core is the same as that in the non-batch mode, and the cache size of the fully-connected layer computational core is the input feature size of the layer multiplied by the batch _ size. Correspondingly, the value of batch _ size is also used as a parallel design parameter when the accelerator is deployed.

(3) Design space search strategy for accelerator deployment

For different computing scenarios, the accelerator deployment needs to select a proper mapping mode, and a final parallel strategy is determined on the basis. The process is shown in algorithm 3:

algorithm 3: search process for global design space

Firstly, comparing the relationship between the on-chip BRAM cache capacity of the target FPGA platform and the total input characteristic diagram amount of the accelerated network model. When the former is larger than the latter, both direct mapping and interlayer fusion mapping are feasible in principle, so the optimal parallel strategies of the accelerator under the two mapping strategies need to be determined respectively, and then the final accelerator parallel strategy is determined by comparing theoretical performance values of the two mapping strategies.

And for the situation that the BRAM capacity of the FPGA is smaller than the total amount of the input feature map, only the optimal parallel strategy in the layer fusion mapping mode needs to be determined. The core of the algorithm 3 is to determine the optimal parallel strategy under different mapping modes. Compared with the traditional accelerator, the accelerator structure of the on-chip curing full network layer has more calculation cores, on one hand, different calculation cores have different calculation and memory access characteristics due to heterogeneous characteristics, on the other hand, the calculation structure of the interlayer flow water enables the calculation cores to be mutually coordinated to avoid flow water pause, and the overall calculation and memory access characteristics of the accelerator become more complex due to two reasons. In order to better depict the computing/memory access behaviors of the accelerator, in the invention, the quantitative relation between on-chip computing and off-chip memory access of different accelerator parallel strategies under different mapping modes is analyzed through a Roofline multi-core performance model, such as a graph 10. Generally speaking, the performance bottleneck for a given system lies primarily in both computational throughput and off-chip data access. By visually establishing the relationship between the two different hardware deployment strategies, the rofoline model can be used to analyze and determine the overall performance bottleneck of the system.

In fig. 10, the axis of abscissa corresponds to the calculation access ratio, and the axis of ordinate corresponds to the calculation throughput. Any coordinate point in the coordinate system can be regarded as a deployment scheme of the accelerator, and the ordinate value and the slope of the accelerator respectively represent the corresponding computation throughput and the bandwidth overhead of the off-chip access memory. The throughput upper limit and the memory access bandwidth upper limit are two fixed values for a given hardware platform, only the coordinate points falling in the area A and the area B are effective deployment schemes correspondingly, and the deployment schemes corresponding to the coordinate points falling in other areas have performance bottlenecks. The deployment scheme 2 corresponding to the blue point in the graph is better than the deployment scheme 1 corresponding to the red point, because the deployment scheme 2 can fully utilize the computing throughput of the hardware platform on one hand, and the corresponding access bandwidth overhead (slope of the coordinate point) is lower on the other hand. Similarly, when two coordinate points in the area a and the area B have the same ordinate value, the point with the larger abscissa value corresponds to a higher data multiplexing degree, and thus is a better deployment scheme. On the basis of a Roofline model, respective combined optimization models are respectively constructed and solved for direct mapping and interlayer fusion mapping, and an optimal parallel strategy is determined according to the models. This process is shown in algorithm 4.

And algorithm 4: roofline model-oriented combined optimization model

Because the number of variables of the combined optimization model is large and the value ranges of different variables are large, in order to accelerate the design space search process, the parallel variables are solved through the genetic algorithm in the algorithm 5.

And algorithm 5: combined optimization model solving process based on genetic algorithm

In algorithm 5, parallel variables are encoded as chromosomes in the form of integer arrays, where any element corresponds to a one-bit variable. The algorithm first completes the random initialization of the pop _ size bar chromosomes in the original population, i.e., the pop _ size variety variable combinations, through lines 2-4. In each round of evolution, an algorithm firstly calculates the fitness corresponding to each chromosome, and determines the genetic probability of different chromosomes according to the fitness. The fitness in different evolutionary stages is defined differently. In the former gen _ round/2-round evolution, the fitness is defined as the product of two objective functions, so as to promote that chromosomes corresponding to high-throughput/high-computation memory access ratios can be more reserved; in the later gen _ round/2-round evolution, fitness is defined as the product of two objective functions divided by the ratio of the corresponding access overhead to the upper limit of the access bandwidth of the hardware platform, so as to accelerate elimination of chromosomes with access bottlenecks. In the crossing and mutation processes corresponding to 2.d and 2.e, the variables in the chromosome also need to be brought into constraint conditions for checking before exchanging and updating so as to ensure the validity of the new variable combination. Through genetic algorithms, the parametric design of hardware resources can be determined. However, influence factors such as LUT resource overhead of the FPGA, clock frequency limitation, and the like, which cannot be accurately quantified, are also involved in the hardware synthesis process, so that experimental hardware synthesis needs to be further performed according to the obtained parallel variables when the actual accelerator is deployed, and manual screening and elimination are performed according to the synthesis result.

The following is a comparison of experimental results of complete hardware deployment of three convolutional neural network models of different scales on two different FPGA platforms and with design work of a CPU, a GPU and other representative FPGA accelerators. It should be noted that when comparing with other FPGA accelerators, the work of deploying the same network model on the same type of FPGA platform by using the traditional convolution calculation mode and the same data type is mainly selected as an analysis and comparison object, so as to illustrate the benefit brought by the improvement of the design method.

Experimental setup:

xilinx ZC7020 and Digilent NetFPGA SUME were used as FPGA target platforms. ZC7020 is an embedded FPGA platform, which comprises a Xilinx Zynq-7020 FPGA chip. The NetFPGA SUME is a high-performance FPGA platform and comprises a Xilinx Virtex-7690T FPGA chip. Here, LeNet-5 and AlexNet models are selected to be deployed on a ZC7020 platform, and AlexNet and VGG16D models are deployed on NetFPGA SUME.

Through design space search, deployment of LeNet-5 and AlexNet on a ZC7020 platform respectively adopts direct mapping and interlayer fusion mapping schemes. In the interlayer fusion, all the convolution layers of the AlexNet model are fused into one level, and the height R 'of the output feature map sub-region of the convolution layer 5'₅And width C'₅The number 2, specific parameters are shown in tables 1 and 2.

Core	Conv1	Conv2	FC3	FC4	FC5
						T
_m	6	6	3	1	1
						T_n	l	4	1	1	1
T_r	10	5	/	/	/
						batch
	1	1	1	1	1

Table 1: LeNet-5 calculates core parallel parameters on Xilinx Zynq7020 platform

Core	Conv1	Conv2	Conv3	Conv4	Conv5	FC6	FC7	FC8
									T
_m	4	9	24	18	6	1	1	1
									T _n	3	8	2	2	4	2	1	1
T _r	3	1	1	1	1	/	/	/
									batch
	1	1	1	1	1	2	2	2

Table 2: AlexNet calculates core parallel parameters on Xilinx Zynq7020 platform

The VGG16D and AlexNet are deployed on the NetFPGA SUME platform by adopting direct mapping and interlayer fusion mapping schemes respectively. In which the VGG16D model adopts two-level interlayer fusion, convolutional layers 1 to 8 and convolutional layers 8 to 13 are fused separately, convolutional layer 8 is used as the fusion connection point, and its input feature map is all cached on the chip. The heights and widths of the output feature map sub-regions of the last convolutional layers in the first-stage fusion and the second-stage fusion are (6, 6) and (7, 7), respectively, and specific parallel parameters are shown in tables 3 and 4.

Core	Conv1	Conv2	Conv3	Conv4	Conv5	Conv6	Conv7	Conv8
									T
_m	6	16	16	32	16	8	8	16
									T _n	3	8	4	4	4	16	16	4
T _r	1	3	3	3	3	3	3	3
									bateh	1	1	1	1	1	1	1	1
Core	Conv9	Conv10	Conv11	Conv12	Conv13	FC14	FC15	FC16
									T_m	32	16	16	8	8	1	1	1
T _n	4	8	2	4	4	3	1	1
									T _r	3	3	3	3	3	/	/	/
batch
		1	1	1	1	1	8	8	8

Table 3: VGG16D parallel parameters of each computational core on Xilinx Virtex-7690T platform

Core	Conv1	Conv2	Conv3	Conv4	Conv5	FC6	FC7	FC8
									T_m	32	14	4	3	2	4	2	2
T _n	3	12	15	15	15	5	5	2
									T _r	6	7	13	13	13	/	/	/
batch
		1	1	1	1	1	3	3	3

Table 4: parallel parameters of each computational core of AlexNet on Xilinx Virtex-7690T platform

And (4) comparing the results:

firstly, comparing the performance results of an FPGA prototype accelerator with a CPU platform and a GPU platform. The results of the accelerator experiment comparisons on the Zynq7020 and Virtex-7690T platforms are shown in FIGS. 11 and 12, respectively, where all values are normalized to the CPU experimental results.

For accelerator deployment on the Zynq7020 platform, prototype accelerators of LeNet-5 and AlexNet were 8.03 x and 9.46 x in performance compared to CPU acceleration ratios, respectively, and 0.118 x and 0.126 x compared to GPU acceleration ratios. This is mainly due to the fact that Zynq7020 is used as an embedded platform, and hardware resources on a chip are limited. The two accelerator deployments respectively use DSP resources of 84% and 100% of a target platform, but the overall parallelism can only reach 185MAC/cycle and 220MAC/cycle, so that the overall computation parallelism of the model is difficult to fully excavate. In terms of energy efficiency, the prototype accelerators of LeNet-5 and AlexNet are respectively improved by 297.6X and 329.6X compared with the CPU and by 3.47X and 3.48X compared with the GPU.

For accelerator deployment on the Virtex-7690T platform, prototype accelerators of AlexNet and VGG16D had performance as compared to CPU speed-up ratios of 110.2 x and 96.8 x, respectively, and GPU speed-up ratios of 1.46 x and 1.05 x. In terms of energy efficiency, the two prototype accelerators are respectively improved by 324.7 x and 321 x compared with the CPU and are respectively improved by 3.43 x and 3.27 x compared with the GPU.

The performance results of the FPGA prototype accelerator versus a typical FPGA accelerator deployment in the past are then compared. The comparative results are listed in table 5. According to the accelerator deployment method, energy efficiency ratios of 2.17 x and 2.5 x above can be obtained by deploying LeNet-5 and AlexNet on a Zynq embedded platform, acceleration ratios of 2.49 x and 2.26 x above and energy efficiency ratios of 2.29 x and 2.1 x above can be obtained by deploying AlexNet and VGG16 on a high-performance Virtex-7690T platform. Compared with the deployment result of the same network model on the same FPGA platform, the performance and the energy efficiency of the VGG16 deployed on the Virtex-7690T platform are respectively improved by 2.31 x and 2.24 x compared with those of Caffeine.

Table 5: comparison of prototype accelerator with past accelerator deployment based on FPGA

The above examples are only for illustrating the technical idea and features of the present invention, and the purpose thereof is to enable those skilled in the art to understand the content of the present invention and implement the present invention, and not to limit the protection scope of the present invention. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A convolutional neural network hardware accelerator for curing full network layers on a reconfigurable platform, comprising:

the calculation module comprises a plurality of calculation cores for calculation, and each calculation core corresponds to different network layers of the convolutional neural network one to one; different computation cores respectively have weight chip access circuits which are logically independent from each other, a unidirectional on-chip data access circuit for transmitting input feature map data is arranged between the front computation core and the rear computation core, and the first computation core and the last computation core respectively have input/output chip access circuits in a feed-forward process; each computing core is respectively used as the first stage of the pipeline, and all computing cores jointly form a complete coarse-grained pipeline structure; each computing core internally comprises a fine-grained computing pipeline; the core inner flow line corresponding to each convolution layer is divided into four stages, namely, taking, inner product, activation and output, and the flow line parallelism is independently designed and optimized according to the calculation parallel mode of the corresponding network layer.

2. The convolutional neural network hardware accelerator of the cured full-network layer on the reconfigurable platform according to claim 1, wherein different computing cores in the computing module include respective local ping-pong cache sets, design parameters of the local ping-pong cache sets are individually adjusted according to parallelism of an in-core pipeline, and computing results of the computing cores are directly input to a subsequent computing core in a streaming manner.

3. The convolutional neural network hardware accelerator of the solidified full network layer on the reconfigurable platform as claimed in claim 1, wherein the input feature map required by each computation core for each round of computation is all cached on a chip by adopting a direct mapping strategy; the direct mapping strategy comprises convolutional layer cyclic expansion, fragmentation and full connection layer fragmentation;

performing M, N-layer loop with size T in the convolutional layer hardware loop unrolling_m、T_nWhile the circulation of the R layer is expanded, the circulation of the R layer is carried out to be T_rThe vector inner product unit structure is obtained by expansion, and the calculation parallelism of the output characteristic diagram of the vector inner product unit structure is T_mEach output characteristic diagram is internally provided with T_rThe output neurons are calculated simultaneously, and the calculation parallelism of each output neuron is T_nTotal computational parallelism of T_m×T_n×T_rThe on-chip weight cache capacity is set to T_m×N×K²K is the corresponding convolution window size;

4. The hardware accelerator of a convolutional neural network for curing full network layer on a reconfigurable platform according to claim 1, characterized in that an interlayer fusion mapping strategy is adopted to cache all input feature maps required by each calculation of each calculation core on a chip; the interlayer fusion mapping strategy comprises the following steps: and determining sub-regions of output characteristic graphs of each layer after interlayer fusion, and applying the interlayer fusion to a parallel mode of interlayer computing flow by changing the processing granularity of flow levels among the layers.

5. The hardware accelerator of a convolutional neural network for curing full network layers on a reconfigurable platform according to claim 4, wherein the interlayer fusion mapping strategy further comprises a mapping mode of multi-level interlayer fusion, and a fusion rule comprises: the first layer is used as a fusion starting point, and if the R, C layer cycle size of the subsequent convolutional layer is larger than or equal to 1/5 of the cycle size of the first layer R, C, the layers are fused into the same level; if R, C layers of the convolutional layer have less circulation than the first layer 1/5, the layer is taken as the fusion starting point of the new layer, and so on;

6. The convolutional neural network hardware accelerator for curing the full network layer on the reconfigurable platform according to claim 1, wherein the control module further comprises a data access optimization module for performing data access optimization on the full connection layer, the data access optimization module comprises a full connection layer balance pruning module for pruning and compressing the full connection layer, and the weight redundancy and the irregularity of the weight data after pruning are eliminated in a retraining manner, and specifically the convolutional neural network hardware accelerator comprises:

for unbalanced output neurons, the weight is filled with a value of 0;

7. The convolutional neural network hardware accelerator for curing the full network layer on the reconfigurable platform according to claim 1, further comprising a full connection layer computation core unit, including a double-index cache and an input neuron selector, respectively for caching weight position information of the full connection layer and selecting a corresponding input neuron from the input cache according to the position information; when the parallelism of the calculation is T_m×T_nIn the computing process, the computing core reads T through off-chip memory access_m×T_nThe weight data and the corresponding position information are sent to T_mA vector multiply-add unit for sending the position information to the input neuron selector for T_mEach vector multiply-add unit selects a corresponding T_nThe input neurons complete inner product calculation and nonlinear transformation operation through the vector multiplication and addition unit; calculate the required T of the next round_m×T_nAnd reading the weight data and the corresponding position information into the corresponding double caches.

8. The convolutional neural network hardware accelerator for curing full network layers on a reconfigurable platform according to claim 6, wherein the data access optimization module further comprises an interlayer pipeline semi-batch processing module, each convolutional layer sequentially calculates different feedforward input data according to a non-batch processing mode, and the fully-connected layer intensively processes the batch size input data in a batch processing mode after waiting until the convolutional layers finish the calculation of the batch size input data.

9. The convolutional neural network hardware accelerator for curing the full network layer on the reconfigurable platform according to claim 1, wherein the relationship between the on-chip cache capacity of the target hardware platform and the total amount of the input feature map of the accelerated network model is compared, when the on-chip cache capacity of the target hardware platform is greater than the total amount of the input feature map data of each layer of the network, the optimal parallel strategy corresponding to the direct mapping strategy and the optimal parallel strategy corresponding to the interlayer fusion mapping strategy are determined, and the final parallel strategy is selected by comparing the optimal parallel strategy corresponding to the direct mapping strategy and the optimal parallel strategy corresponding to the interlayer fusion mapping strategy; otherwise, determining an optimal parallel strategy corresponding to the interlayer fusion mapping strategy, and taking the optimal parallel strategy as a final parallel strategy; the method for determining the optimal parallel strategy comprises the following steps:

10. A convolutional neural network hardware acceleration method for curing full network layers on a reconfigurable platform, which is characterized in that a hardware accelerator of claim 1 or 2 is adopted, and the acceleration method comprises the following steps: