CN113821981A

CN113821981A - Method and device for constructing convolutional neural network data flow design space analysis tool

Info

Publication number: CN113821981A
Application number: CN202111171756.9A
Authority: CN
Inventors: 景乃锋; 胡令矿; 霍洋洋; 张子涵; 蒋剑飞; 王琴; 绳伟光; 毛志刚
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-10-08
Filing date: 2021-10-08
Publication date: 2021-12-21

Abstract

The invention discloses a method and a device for constructing a convolutional neural network data flow design space analysis tool, which are used for constructing a convolutional network data flow design space exploration method facing an array processing structure by combining hardware characteristics and the limits of computing resources and storage resources, and providing a guidance direction for mapping a convolutional neural network algorithm on a spatial array processing structure.

Description

Method and device for constructing convolutional neural network data flow design space analysis tool

Technical Field

The invention relates to the field of array processing structure design and neural network data scheduling, in particular to a method and a device for constructing a convolutional neural network data flow design space analysis tool.

Background

Convolutional Neural Networks (CNN) have been widely used as a deep learning framework in various fields such as images and medical treatment. Along with the continuous and deep development of people on the research of the convolutional neural network, the structure of the neural network becomes more complex, and the convolutional scale is also continuously enlarged. Various convolutional neural network accelerator designs have been proposed to accelerate the convolution process using an arrayed architecture of a large number of processing units. However, due to the limited on-chip computing resources and storage capacity of the hardware platform and the diversity of the architecture hardware parameters, the optimal deployment of the convolutional neural network framework on the platform becomes difficult, and a comprehensive and effective solution for the neural network deployment on different platforms is lacking at present. Therefore, the convolutional neural network design space exploration under the general hardware has practical significance and necessity.

It is noted that the information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention aims to provide a construction method and a device of a convolutional neural network data flow design space analysis tool, which are used for solving the problem that a comprehensive and effective solution aiming at the deployment of a neural network on different platforms is lacked in the prior art.

In order to solve the above technical problem, the present invention provides a method for constructing a convolutional neural network data stream design space analysis tool, where the method includes:

carrying out hardware modeling aiming at an array structure accelerator to obtain hardware parameters of the array structure accelerator;

analyzing a data stream scheduling mode of a convolution seven-layer cycle structure, and acquiring an optimization strategy for describing cycle switching by establishing a mapping relation between a data stream and a cycle sequence;

and carrying out quantitative calculation of convolution delay based on the hardware parameters and the optimization strategy, obtaining the optimization parameters with the lowest convolution delay in a design space, and constructing a design space analysis tool.

Optionally, the construction method further comprises:

and accessing a TVM (television map) compiler Relay to support different neural network structures based on different deep learning frameworks.

Optionally, the hardware parameters include computing resource parameters and storage resource parameters.

Optionally, the computing resource parameters include the number of physical slices that can be independent, Slice connectivity granularity, the number of multiply-add units used for core computation on a single Slice, and the number of LSEs on a single Slice.

Optionally, the storage resource parameters include DRAM bandwidth, SRAM total capacity of each individual Slice, SRAM bandwidth, and SRAM partition granularity.

Optionally, the analyzing a data stream scheduling manner of the convolution seven-layer circular structure, and obtaining an optimization strategy describing circular exchange by establishing a mapping relationship between a data stream and a circular sequence includes:

and acquiring a block optimization strategy of the cycle exchange by establishing a mapping relation between the data stream and the cycle sequence based on a block optimization strategy of the block parameter description cycle and an expansion optimization strategy of the expansion parameter description cycle.

Optionally, the hardware model includes an array slice, a storage module, and a control module.

Optionally, the obtaining an optimization parameter with the lowest convolution delay in the design space includes:

and calculating all delays by traversing the communication quantity of the Slice, the data streams of the inner layer and the outer layer, the cycle expansion parameters and the cycle block parameters, and acquiring a parameter combination with the lowest delay as the optimization parameters.

Based on the same invention concept, the invention provides a construction device of a convolutional neural network data flow design space analysis tool, which comprises the following components:

a hardware parameter obtaining module configured to perform hardware modeling for an array structure accelerator, and obtain hardware parameters of the array structure accelerator;

the optimization strategy acquisition module is configured to analyze a data stream scheduling mode of a convolution seven-layer cycle structure and acquire an optimization strategy describing cycle exchange by establishing a mapping relation between a data stream and a cycle sequence;

and the analysis tool construction module is configured to perform quantitative calculation of convolution delay based on the hardware parameters and the optimization strategy, acquire the optimization parameters with the lowest convolution delay in a design space, and construct a design space analysis tool.

Based on the same inventive concept, the invention further provides a readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, can implement the method for constructing the convolutional neural network data stream design space analysis tool described in any of the above features.

Compared with the prior art, the invention has the following beneficial effects:

1. combining hardware characteristics and the limits of computing resources and storage resources, constructing a convolutional network data stream design space exploration method facing an array processing structure, and providing a guidance direction for mapping a convolutional neural network algorithm on a spatial array processing structure;

2. based on the established exploration algorithm, an automatic analysis tool is realized, and the reading of neural networks of learning frames at different depths is realized by means of a TVM (transient graphics model) compiler Relay, so that the tool has certain universality;

3. the space exploration is carried out on the scheduling mode of the data flow of the facing convolutional neural network, the requirements of the computational power, the storage capacity and the data transmission bandwidth of a processor are analyzed, and the architectural design direction is provided for the array processing structure.

Drawings

FIG. 1 is a schematic diagram of a convolution calculation process and parameter representation;

FIG. 2 is a schematic diagram of spatial, temporal multiplexing and its hardware implementation;

FIG. 3 is a schematic diagram of input multiplexing (input fixing);

FIG. 4 is a diagram illustrating weight multiplexing (weight fixing);

FIG. 5 is a schematic diagram of output multiplexing;

FIG. 6 is a schematic diagram of a generic array accelerator hardware architecture;

FIG. 7 is a schematic diagram of a logical slice and a physical slice;

FIG. 8 is a schematic diagram of a convolution pseudo-code incorporating a hardware structure;

FIG. 9 is a schematic flow diagram of an automated analysis tool implementation;

FIG. 10 is a diagram illustrating a comparison of delay times for different data flows;

FIG. 11 is a schematic diagram of time delay analysis in different expansion block-division modes;

FIG. 12 is a schematic diagram of the partitioning in the convolution calculation;

fig. 13 is a flowchart illustrating a method for constructing a convolutional neural network data flow design space analysis tool according to an embodiment of the present invention.

Detailed Description

The following describes in more detail embodiments of the present invention with reference to the schematic drawings. The advantages and features of the present invention will become more apparent from the following description. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.

In the description of the present invention, it is to be understood that the terms "center", "upper", "lower", "left", "right", and the like indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

As application demands increase, a variety of neural network accelerator designs that accelerate convolution operations using arrayed structures of a large number of processing units have been proposed in succession. However, due to the limited on-chip computing resources and storage capacity of the hardware platform and the diversity of the architecture hardware parameters, the optimal deployment of the convolutional neural network framework on the platform becomes difficult, and a comprehensive and effective solution for the neural network deployment on different platforms is lacking at present. The invention provides a construction method of a convolutional neural network data flow design space analysis tool, which is characterized in that hardware modeling is carried out on an accelerator with an array structure, the hardware characteristics of the array accelerator are analyzed, and key hardware parameters such as the number of processing units, the storage bandwidth and the capacity, which influence the convolutional performance, are extracted; and simultaneously analyzing a data stream scheduling mode of the convolution seven-layer circular structure, and describing an optimization strategy of circular exchange by establishing a mapping relation between the data stream and a circular sequence. And carrying out quantitative calculation on the convolution delay based on hardware parameters and an optimization strategy, searching an optimization parameter with the lowest convolution delay in a design space, and constructing an automatic analysis tool of the design space. And finally, a front-end graph compiler Relay in the TVM (temporal Virtual machine) is accessed to support different neural network structures based on different deep learning frames. The tool can provide a mapping optimization scheme of each convolutional layer on the array processing structure based on the layer-by-layer analysis of an exploration algorithm, and meanwhile, reference can be provided for the optimization design of hardware through the exploration of a design space.

Referring to fig. 13, an embodiment of the present invention provides a method for constructing a convolutional neural network data stream design space analysis tool, where the method includes:

s1: carrying out hardware modeling aiming at an array structure accelerator to obtain hardware parameters of the array structure accelerator;

s2: analyzing a data stream scheduling mode of a convolution seven-layer cycle structure, and acquiring an optimization strategy for describing cycle switching by establishing a mapping relation between a data stream and a cycle sequence;

s3: and carrying out quantitative calculation of convolution delay based on the hardware parameters and the optimization strategy, obtaining the optimization parameters with the lowest convolution delay in a design space, and constructing a design space analysis tool.

The difference from the prior art is that the embodiment provides a method for constructing a convolutional neural network data stream design space analysis tool. Firstly, the invention carries out hardware modeling aiming at the accelerator with an array structure, analyzes the hardware characteristics of the array accelerator and extracts key hardware parameters influencing the convolution performance. These parameters, such as number of processing units, memory bandwidth and capacity, describe the general hardware architecture of the accelerator while also having a critical impact on neural network performance.

Secondly, the invention explores the design space of the convolution data stream based on the convolution seven-layer cycle structure and the extracted hardware parameters. Aiming at convolution circulation, a blocking parameter T (Tile) is introduced to describe a cyclic blocking optimization strategy, an expansion parameter P (Pipline) is introduced to describe a cyclic expansion optimization strategy, and a mapping relation between data flow and cyclic sequence is established to describe a cyclic exchange optimization strategy. And carrying out quantitative calculation of the convolution delay based on the hardware parameters and the optimization parameters, searching the optimization parameters with the lowest convolution delay in the design space, and constructing a design space searching algorithm.

Finally, according to the established design space exploration algorithm, the invention realizes an automatic analysis tool, and realizes the neural network reading analysis of different deep learning frameworks by accessing a front-end graph compiler Relay in TVM (temporal Virtual machine), and the tool can give the optimization strategy of each layer of convolutional layer based on the layer-by-layer analysis of the exploration algorithm. The performance improvement of the exploration algorithm under a specific network structure is analyzed through experiments by the analysis tool. The method and the tool provided by the invention can determine the optimization scheme of the convolutional neural network data stream, and can provide reference for the optimization design of hardware through design space exploration.

Optionally, the construction method further comprises:

Most of array processing structure designs facing the neural network field are concentrated in the special field, most of ASIC accelerators are designed into a fixed computing mode, the computing requirements of the CNN are more and more diversified, and no single solution can meet all the requirements. The common storage system also adopts a fixed design and a targeted optimization means, the access mode and the data flow of the storage system cannot be adjusted according to the characteristics of different layers of the network, the calculation parameters of different network layers are varied, and the single data block mode and the storage resource allocation strategy cannot simultaneously achieve the optimal acceleration effect of each layer of the network. In order to adapt to the diversity of the CNN network structure, more and more accelerators have certain reconfigurable capability, which includes reconfiguration of interconnection, reconfiguration of PE operator function and reconfiguration of calculation mode. But how to take advantage of the flexibility of reconstruction becomes a new challenge in choosing the optimal data stream for a particular CNN accelerator. It is therefore essential to provide the given type of CNN workload (workload) with the computational solutions it requires.

Convolutional layers are the core of CNNs and are used to extract deep features of images. The convolution calculation accounts for more than 90% of the total CNN calculation amount, and formula (1) gives a definition of two-dimensional convolution calculation:

where x and w represent the elements of the input matrix and the weight matrix (convolution kernel), respectively, the size of the convolution kernel being m × n, y being the output of the table convolution. In convolutional layers, the image input is typically a set of three-dimensional tensors, the third dimension is called the channel (channel), and the corresponding convolutional kernel also uses the three-dimensional tensor and has the same channel depth as the input feature map. Fig. 1 is a calculation model of convolutional layers, and the three-dimensional convolutional kernels with Noc convolutional kernel sizes of Nkx × Nky × Nic extract features from a three-dimensional feature map with Nix × Niy × Nic sizes to obtain an output feature map with Nox × Noy × Noc sizes. The meaning of each symbol is shown in table 1.

TABLE 1 sign and meaning in convolution calculation

In order to avoid the loss of pixel information of the image boundary during convolution, it is usually necessary to fill the input image (Padding), where the step size of convolution is S, and the number of pixels filled on the boundary is P, so that the sum of the sizes of the output feature maps can be expressed by the following formula.

The high computation of CNN represents a widespread parallel opportunity to fully utilize hardware computing resources. These high computational efforts also require high data transmission costs. Equation (1) shows that there is a correlation between the input data, the weight data and the output number of the convolution operation, which means that we can mine the reuse of the same data in the calculation to effectively reduce the access times of the memory. Data multiplexing stems from both the temporal and spatial behavior of the CNN accelerator, i.e. multicasting of incoming data (multicast) and reduction of intermediate computation results (reduction), as shown in fig. 2.

The method combines the space domain and time domain data multiplexing method with different parallel computing modes of the convolutional layer, and multiplexes input characteristic diagram data, output characteristic diagram data and convolutional kernel data to obtain three multiplexing data streams of the convolutional neural network accelerator: input Stationary (IS), Weight Stationary (WS), and Output Stationary (IS). Fig. 3 shows a description of a hardware structure of an input fixed data stream, in which input feature map pixel data is read from a global buffer and stored in a register inside each PE unit during calculation, and then weight data and partial sum data are repeatedly read and sent to the PE units for multiply-add operation, the output partial sum data is transmitted between the PE units through an interconnection path, and input pixels are held in the registers of the PE units until all partial sum output results are obtained by using complete calculation. Fig. 4 shows a hardware structure description of a weight-fixed data stream, where in calculation, convolution kernel weight data is read from a global cache and sent to a PE unit register, then input feature image pixels and output part and data are repeatedly read and sent to the PE unit, the output part and data are transmitted between the PE units through an interconnection, and weight data are held in the register until all input pixels are read and calculated to obtain all part and output results. FIG. 5 illustrates a hardware architecture for outputting a fixed data stream, where a calculation is performed by first reading the output portion and data from the global cache into the PE unit registers, and then repeatedly reading the input feature image pixel data and the weight data, where the data of the input pixels is passed between the PE units via the interconnect, and the PE units internally perform the accumulation of the output portion sums until each PE unit generates a complete portion sum output.

The method aims at the CNN-oriented general array processing structure to perform hardware modeling, analyzes common characteristics of a general array structure accelerator, and performs model abstraction and key parameter extraction. The general hardware model established by the method is composed of the array slice, the storage module and the control module, different calculation optimization strategies can be adopted for each layer of convolution layer in the convolutional neural network through different task configuration words, different data flow requirements can be met, the requirements of the array structure accelerator are met, and the subsequent design space exploration algorithm has reference significance. Fig. 6 details the hardware architecture of a general array structure accelerator, and the accelerator framework may be divided into a slice controller, a storage controller, and an array slice from the hierarchical structure, where multiple slices of the array slice work independently, and the number of the array slice is defined as Nslice. Writing an optimized configuration scheme of the whole neural network into a Slice controller through a bus, writing information such as data storage addresses, data streams and the like into the configuration scheme, writing corresponding configuration files into an array controller on Slice by the Slice controller according to a calculation task of each array Slice (Slice), writing corresponding input pixels, weight data, output parts and data into an on-chip cache of each Slice by an execution stage storage controller according to a blocking requirement, enabling a PE array by the array controller on the Slice, sequentially generating data addresses by a storage processing unit (Load Store Element, LSE) on the PE array on a time axis according to configuration words written by the array controller, reading data from the on-chip cache, and writing the data back to the on-chip cache through the LSE after calculation is completed. The hardware structure can adopt different parallel processing modes between the interior of Slice and the Slice, and meets the requirements of parallel acceleration and data multiplexing on the convolutional layer.

The array slice connectivity granularity refers to the capability of data interaction between array slices, when different array slices are interconnected through a First-In First-Out (FIFO) unit, sharing of on-chip cache capacity and sharing of a PE array can be achieved between the slices, and after an array slice connectivity granularity parameter Nshare is introduced, the slices can be further divided into physical slices and logical slices, as shown In fig. 7, when the connectivity Nshare of the array slices is 2 and the total number of the physical slices is 8, actually calculating according to 4 larger logical slices, and achieving data sharing between the array slices through the FIFO. By setting the array slice communication granularity, the convolution layers with different sizes can be more flexibly coped with, when the convolution layer is smaller in size, the on-chip cache capacity on each physical slice can meet the requirement of data storage, and each slice can independently complete the calculation task without mutual data exchange. When the size of the convolution layer is larger, a plurality of physical slices are integrated into one logic slice, and the logic slice has larger on-chip cache capacity and more computational resources to meet the computational requirement of the convolution layer with the large size.

Under an arrayed hardware architecture, due to the fact that the bandwidth of external storage (DRAM) is limited, the consumption of the DRAM bandwidth is larger when the number of slices is larger, the DRAM bandwidth limits the computing efficiency of the array to a great extent when the number of slices is gradually increased, and becomes a main factor limiting the performance of a convolution accelerator, and therefore the feasibility of data multiplexing between the external storage and on-chip cache of the array slices is urgently needed to be found. Table 2 illustrates the storage model parameters extracted from the general array structure convolution accelerator, and since the storage resources cannot be kept in the peak bandwidth operation at any time, the calculation and analysis of the access time are facilitated by introducing the actual bandwidth correction factor.

Table 2 storage model parameter summary

The arrayed hardware architecture established by the application can be divided into four levels from the calculation perspective: inter Slice (Inter Slice), Inter Tile (Inter Tile), Intra Tile (Intra Tile), Single parallel Computing Task (Single Computing Task). Fig. 8 shows a description of a convolution nested loop, in which an optimization method of loop blocking, loop unrolling and loop exchange is added, the present application analyzes in detail and completes mathematical modeling for each level, and forms an automatic analysis tool, and the tool executes a flow as shown in fig. 9. The execution logic of the automated analysis tool is divided into three parts: user input, convolutional layer parameter extraction, and design space exploration. Firstly, a user configures hardware parameters, wherein the hardware parameters comprise parameters of computing resources and parameters of storage resources, the parameters of the computing resources comprise an independent physical Slice number NSlice, a Slice connected granularity Nshare, a multiplication and addition unit number Npe/Slice and an LSE number NLSE/Slice which can be used for core computation on a single Slice, and the storage resources comprise a DRAM bandwidth BWDRAM, an SRAM total capacity CSRAM/Slice of each independent Slice, an SRAM bandwidth BWSRAM/Slice and an SRAM divided granularity delta CSRAM. Then, a user takes a convolutional neural network model described by Tensorflow or Pythrch as parameter input, and extracts all convolutional layer parameters in the convolutional network through a general deep learning framework convolutional layer parameter extraction module. And finally, exploring an optimization scheme in a design space, calculating the time delay of all possible conditions by traversing the communication quantity of the Slice, the data flow of the inner layer and the outer layer, the cyclic expansion parameter and the cyclic block parameter, and taking out the lowest time delay as the optimization scheme to be output.

In order to make the automatic analysis tool universal, so that it can identify the neural network described by different deep learning frameworks and complete the convolutional layer parameter extraction, we access the TVM tool. The graph compiler Relay in the TVM tool is able to compile the neural network structure under different deep learning frameworks into a computational graph of the same format, with an intermediate representation in text form. Therefore, the parsing of different frame networks can be realized by means of the graph compiler Relay in the TVM, and Intermediate Representation (IR) descriptions of all network structures are obtained. On the basis, extracting parameters of the convolutional layers in the network according to a text file represented in the middle of the text file, reading text description of the network line by line, using a regular expression to extract operators, analyzing changes of different operators to image sizes, identifying the convolutional layers from the operators, extracting parameters such as input images, input channels, convolutional kernel sizes, output channels and step lengths of the convolutional layers, calculating the size of an output image of each convolutional layer according to the following formula (3), and finally extracting the parameters of all the convolutional layers in the network.

The newly proposed convolutional neural network data flow design space exploration tool for the array processing structure mainly comprises the following three points:

2. based on the established exploration algorithm, an automatic analysis tool is realized, and the neural network of different deep learning frames is read by means of a graph compiler Relay of the TVM, so that the tool has certain universality;

The invention provides an automatic analysis tool for constructing a design space aiming at a design space exploration scheme of a convolutional neural network under a general array structure. Meanwhile, different neural network structures based on different deep learning frameworks are supported by accessing the TVM compiler. The tool can provide a mapping optimization scheme of each convolutional layer on the array processing structure based on the layer-by-layer analysis of an exploration algorithm, and meanwhile, reference can be provided for the optimization design of hardware through the exploration of a design space. The method tests different networks, compares the influences on the calculation performance of the convolutional network under different block expansion parameters and different data flows, and evaluates the performance improvement brought by the design space exploration algorithm. The general array structure convolution accelerator model set in the test has 16 array slices in total, 36 PEs are arranged on each slice, the connectivity between the slices is set to be 1, namely each slice does not share an on-chip cache and a computing unit, and meanwhile, the working frequency of the accelerator is set to be 1 GHz. For storage parameter configuration, DDR4-3200MHz is used as external storage in the test, the bandwidth is 25.6GB/s, the on-chip cache capacity of each slice is 128KB, the number of banks cached in each slice is 16, the stored word length is 32bit, and the bandwidth of SRAM is 512 bit/cycle:

YOLOV3 has 75 convolutional layers. And through tool testing, the corresponding block expansion and data flow information of each layer of the 75 layers of convolutional layers is obtained. Statistically, as shown in table 5, for the YOLOV3 network, 82.7% of the convolutional layers between the outer layer blocks use the input fixed data stream, and only the first convolutional layer uses the fixed weight data stream. This is because the input signature size of the first convolutional layer of the network is very large and the number of channels and convolutional kernel size are small, so a data stream with fixed weights is used. The convolution layer in the middle has a large input channel and a large output channel, so that the size of convolution kernel data is large, and the convolution kernel data is not suitable for a data stream with fixed weight value and is suitable for inputting and outputting the fixed data stream. The data stream form in the inner layer block is directly influenced by the corresponding block parameters, and through observation and analysis, the block parameters of the input channel in the block parameters given by the scheme are often several times of the block parameters of the output channel, so that the block size of the input image is much larger than that of the output image, and therefore, no fixed data stream is input in the inner layer data stream.

TABLE 5 data flow statistics for optimization strategies

Fig. 10 shows a comparison of the calculated delay of YOLOV3 for the variable data stream and the fixed data stream presented herein, and it can be seen that the calculated delay of the input fixed data stream IS very close to 1.09 times that of the variable data stream, since most layers in the variable data stream use the input fixed data stream, and therefore when the IS used for the entire network, only a few convolutional layers result in performance loss. The calculation delay of the data stream with the fixed weight is 2.88 times that of the variable data stream, which causes great performance loss, because the middle layer of the network often has hundreds of input channels and output channels, the size of the convolution kernel is large, when the data stream with the fixed weight is adopted, frequent data updating still needs to be performed, and the time consumed by data access and storage is increased. The delay of outputting the fixed data stream IS 1.45 times that of the variable data stream, between IS and WS.

Fig. 11 shows the comparison of computation delays under different expansion blocking strategies. The fixed block parameters and the expansion parameters of the latter three comparison conditions are selected according to the optimization scheme with the highest occurrence frequency in the first variable block and variable expansion parameters. It can be seen that the performance difference between the blocking parameter and the spreading parameter is smaller compared to the difference of the data stream, and the delay of the optimization method is reduced to 0.86 times compared to the fixed spreading, 0.81 times compared to the fixed blocking delay, and 0.77 times compared to the full fixing. It can be seen that by optimizing the convolutional layer by layer, the inference delay of the whole network is shortened.

The convolutional neural network data flow design space exploration tool for the array processing structure is mainly characterized in that a convolutional neural network data flow design space exploration method for the array processing structure is constructed by combining hardware characteristics and limits of computing resources and storage resources, and a guidance direction is provided for mapping a convolutional neural network algorithm on a spatial array processing structure; based on the established exploration algorithm, an automatic analysis tool supporting the general CNN network is realized; the space exploration is carried out on the scheduling mode of the data flow of the facing convolutional neural network, the requirements of the computational power, the storage capacity and the data transmission bandwidth of a processor are analyzed, and the architectural design direction is provided for the array processing structure.

And performing detailed analysis on each layer from the aspects of loop blocking, loop unfolding, loop exchange and the like aiming at the established arrayed hardware architecture, and completing mathematical modeling. Firstly, the size of the on-chip cache capacity of each slice is often not enough to completely store all data of the deep convolutional neural network algorithm, so that the size of the convolutional layer needs to be cut according to the cache capacity of the slice, thereby completing the cyclic blocking. The present invention segments the data required for convolutional layer calculation according to the size of the on-chip cache capacity of each slice, thereby completing cyclic blocking, as shown in the cyclic blocking diagram of fig. 12. Table 6 lists all blocking parameters within the convolutional layer seven-layer loop using the loop blocking strategy. According to the blocking parameters, the buffer capacity occupied by pixel data, weight data, parts and data in the on-chip buffer of a single slice can be calculated, and parameters Csram _ weight, Csram _ pixel and Csram _ psum are defined.

C_sram__weight＝T_kx*T_ky*T_ic*T_oc*weight_datawidth (4)

C_{sram_pixel}＝T_ix*T_iy*pixel_datawidth (5)

C_{sram_psum}＝T_ox*T_oy*pixel_datawidth (6)

Table 6 summary of cyclic chunking parameters

Secondly, in order to fully utilize the computing units and the transmission bandwidth on the chip to realize large-scale data parallelism, the tool analyzes computing resources and transmission bandwidth requirements brought by different cyclic expansion strategies and mapping modes under different strategies. Specifically, as shown in equation 7, table 7 and table 8.

Compute＝P_kx×Pky×P_ox×p_oy×P_ic×P_oc (7)

TABLE 7 memory access to DRAM with cycle expansion

TABLE 8 memory access to SRAM with loop expansion

Finally, the performance evaluation for the array accelerator is primarily for its computation latency, i.e., the time it takes to complete the entire convolutional network computation. The total delay tdelay is an important index for describing a convolution acceleration optimization scheme, and the total delay calculation method is used for analyzing the total delay calculation method under the conditions of input fixation, output fixation and weight fixation. In order to analyze the time sequence of different data streams, firstly, the time for accessing input, output and weight data from DRAM is calculated, parameters of tloadinput, tloadweight, tloadpsum and twritepsum are respectively used for representing the time for loading input, the time for loading weight, the time for loading part sum and the time for writing back part sum from external storage, parameters of Nloopict, Nloopict,

NloopoxyT represents the execution times of ic, oc and oxy cycles among blocks respectively, and the introduced parameter tcompute represents the time consumed by the PE array to calculate one block. Equation 9 is a total delay calculation equation based on the above analysis.

It is understood that the hardware parameter obtaining module, the optimization strategy obtaining module, and the analysis tool constructing module may be combined and implemented in one device, or any one of the hardware parameter obtaining module, the optimization strategy obtaining module, and the analysis tool constructing module may be split into a plurality of sub-modules, or at least part of functions of one or more of the hardware parameter obtaining module, the optimization strategy obtaining module, and the analysis tool constructing module may be combined with at least part of functions of other modules and implemented in one functional module. According to an embodiment of the present invention, at least one of the hardware parameter obtaining module, the optimization strategy obtaining module, and the analysis tool building module may be at least partially implemented as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or in a suitable combination of three implementations of software, hardware, and firmware. Alternatively, at least one of the hardware parameter acquisition module, the optimization strategy acquisition module, and the analysis tool construction module may be at least partially implemented as a computer program module that, when executed by a computer, may perform the functions of the respective module.

The readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device, such as, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. The computer programs described herein may be downloaded from a readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer program from the network and forwards the computer program for storage in a readable storage medium in the respective computing/processing device. Computer programs for carrying out operations of the present invention may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), can execute computer-readable program instructions to implement various aspects of the present invention by utilizing state information of a computer program to personalize the electronic circuitry.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, systems and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer programs. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the programs, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a computer program may also be stored in a readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the readable storage medium storing the computer program comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the computer program which executes on the computer, other programmable apparatus or other devices implements the functions/acts specified in the flowchart and/or block diagram block or blocks.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example" or "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. And the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments. Furthermore, various embodiments or examples described in this specification can be combined and combined by those skilled in the art.

The above description is only a preferred embodiment of the present invention, and does not limit the present invention in any way. It will be understood by those skilled in the art that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A construction method of a convolutional neural network data stream design space analysis tool is characterized by comprising the following steps:

2. The method of constructing a convolutional neural network data stream design space analysis tool as claimed in claim 1, wherein said method of construction further comprises:

3. The method of constructing a convolutional neural network data stream design space analysis tool as claimed in claim 1, wherein said hardware parameters include computational resource parameters and storage resource parameters.

4. The method of constructing a convolutional neural network data stream design space analysis tool as claimed in claim 3, wherein said computational resource parameters include the number of physical slices that can be independent, Slice connectivity granularity, the number of multiply-add units used for core computation on a single Slice, and the number of LSEs on a single Slice.

5. The method of claim 3, wherein the storage resource parameters include DRAM bandwidth, SRAM total capacity of each individual Slice, SRAM bandwidth, and SRAM partition granularity.

6. The method for constructing a convolutional neural network data flow design space analysis tool as claimed in claim 1, wherein said analyzing a data flow scheduling manner of a convolutional seven-layer cyclic structure, and obtaining an optimization strategy describing cyclic exchange by establishing a mapping relationship between data flows and cyclic sequences comprises:

7. The method of constructing a convolutional neural network data stream design space analysis tool as claimed in claim 1, wherein said hardware model comprises an array slice, a storage module and a control module.

8. The method for constructing a convolutional neural network data stream design space analysis tool as claimed in claim 1, wherein said obtaining the optimized parameters with the lowest convolutional delay in the design space comprises:

9. A construction apparatus for a convolutional neural network data stream design space analysis tool, the construction apparatus comprising:

10. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, is capable of implementing the method of constructing a convolutional neural network data stream design space analysis tool as claimed in any one of claims 1 to 8.