CN115130672B - Software and hardware collaborative optimization convolutional neural network calculation method and device - Google Patents

Software and hardware collaborative optimization convolutional neural network calculation method and device Download PDF

Info

Publication number
CN115130672B
CN115130672B CN202210642934.XA CN202210642934A CN115130672B CN 115130672 B CN115130672 B CN 115130672B CN 202210642934 A CN202210642934 A CN 202210642934A CN 115130672 B CN115130672 B CN 115130672B
Authority
CN
China
Prior art keywords
module
neural network
hardware
convolutional neural
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210642934.XA
Other languages
Chinese (zh)
Other versions
CN115130672A (en
Inventor
何炎祥
刘芳
胡刚
彭敏
范子蒙
黄飞虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210642934.XA priority Critical patent/CN115130672B/en
Publication of CN115130672A publication Critical patent/CN115130672A/en
Application granted granted Critical
Publication of CN115130672B publication Critical patent/CN115130672B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/96Management of image or video recognition tasks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Neurology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a device for calculating a software and hardware collaborative optimization convolutional neural network, which aim at module analysis and collaborative design of a SheffleNetV 2 model, and realize quantization and improvement of a calculation unit; optimizing based on the reconfigurable computing equipment aiming at the characteristics of the model; the method realizes the quantization of 8 bits, and simultaneously redesigns the depth separable convolution operation, so that the module operates in a hardware friendly mode, and the experimental work of the software and hardware collaborative optimization is completed on an FPGA platform xilinxzynqxc-7Z045 by using HLS (High Level Synthesis). The invention obviously improves the resource utilization rate and the time delay of the optimized CNN calculation model.

Description

Software and hardware collaborative optimization convolutional neural network calculation method and device
Technical Field
The invention relates to the technical field of deep learning and neural network acceleration, in particular to a method and a device for computing a software and hardware collaborative optimization convolutional neural network.
Background
With the development of deep learning technology, the convolutional neural network CNN has been expanded from images to text, video and speech due to its characteristics of high inference accuracy and strong adaptivity. CNNs are also being developed and optimized, and lightweight models such as SqueezeNet, mobileNet, shuffleNet, xception have emerged that use special structures or units to reduce the amount of computation. However, in the convolution calculation of the conventional CNN variant network, the matrix operation of the core needs high calculation power, and especially, the calculation efficiency is reduced due to the increase of the network depth. How to build an efficient computing system towards CNN is a problem to be solved. The method for solving the problem mainly relies on network optimization of CNN or design of hardware accelerator according to the characteristics of CNN.
In the process of implementing the present invention, the present inventors have found that the method of the prior art has at least the following technical problems:
in the current method, a novel deep neural network model needs to be matched with a hardware architecture, so that model characteristics cannot be fully utilized under a general hardware architecture; the neural network model accelerator design does not fully consider the characteristics of the model itself, and therefore more efficient calculation cannot be realized.
Disclosure of Invention
The convolutional neural network CNN (Convolutional Neural Networks) is a high-efficiency recognition algorithm applied to the fields of pattern recognition, image processing and the like. The core of the convolution kernel and pooling operation in the CNN is matrix operation, which requires high computational power, and the convolution and pooling operation is in a streaming operation mode. CNN-oriented build-efficient computing systems have traditionally been implemented primarily in software optimizations or hardware accelerators, whereas in practice software algorithms and hardware architectures are often interacted with in neural networks.
In view of the above, the present invention aims to invent a method and a device for computing a software and hardware collaborative optimization convolutional neural network, which are used for solving or at least partially solving the technical problems of low computing resource utilization rate and high computing delay in the prior art.
In order to solve the technical problem, a first aspect of the present invention provides a method for optimizing convolutional neural network computation by software and hardware cooperation, including:
s1: analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model;
s2: compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain compressed weight parameters;
s3: dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the module parallel number, and the parameter scale is obtained after dividing the compressed weight parameters into each module;
s4: constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module, and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
s5: according to the divided modules, the constructed hardware architecture and the set calling sequence, simulation experiments are carried out, modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process are found, the found modules are optimized, the calculation flow of a convolution module is designed, and then iteration is carried out according to hardware resource consumption and time delay of an original convolution neural network model so as to achieve the optimal operation effect, so that an optimized software implementation scheme is obtained;
s6: based on the optimized software implementation scheme, the calculation sequence of the module influencing the calculation efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process is redesigned, and the whole hardware architecture scheme is designed according to the hardware resource consumption and the time delay of the original convolutional neural network model.
In one embodiment, the original convolutional neural network model is a ShuffleNetV2 model.
In one embodiment, step S5 of optimizing the found modules includes:
the hardware resources are tiled aiming at the modules with the calculation efficiency lower than the preset value so as to increase the parallel lines;
and performing function splitting on the modules with calling frequencies exceeding the threshold value to reduce the calling times of the single module.
In one embodiment, the convolution module includes a channel dividing unit and a calculating unit, and the calculating process of designing the convolution module in step S5 includes:
the characteristic image is forwarded to a convolution module, an identifier Flag is set to distinguish the category of the convolution operation, wherein when flag=pw, the operation is channel convolution; this operation is a spatial convolution when flag=dw; this operation is a normal convolution when flag=con;
the channel segmentation unit detects whether an input characteristic image is divided into two parts according to the dimension direction, if not, the input characteristic image is segmented, and the segmented characteristic image is transmitted to the calculation unit, wherein the calculation unit comprises a plurality of PE and an addition tree;
and performing multiplication operation on the weights of the segmented feature images and the PE in parallel through a computing unit, then completing convolution operation through a group of addition trees, and finally performing addition operation on each output result and the bias to obtain a convolved three-dimensional feature matrix.
In one embodiment, step S6 includes:
compressing data through parameter quantization operation to enable a convolutional neural network model to be stored on a block random access memory number, wherein data reading is realized by an AXI4 bus, and an ARM core command is transmitted through the AXI4 bus;
when the convolutional neural network model is calculated, an interrupt signal is generated, the interrupt signal is returned to the ARM core through terminal control, the completion of the reasoning calculation part is marked, and the ARM core takes out the reasoning result from the block random access memory number to complete the whole reasoning process;
and (3) carrying out overall control by using an ARM check system on the FPGA, calling a convolutional neural network model to carry out reasoning, and recording information and parameter configuration.
In one embodiment, the method further comprises:
and predicting the original data by using the deployed original neural network model to obtain a prediction result.
Based on the same inventive concept, the second aspect of the invention provides a method for optimizing convolutional neural network calculation by combining software and hardware, which comprises the following steps:
the analysis module is used for analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model;
the compression module is used for compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain the compressed weight parameters;
the software design module is used for dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the parallel data of the modules, and the parameter scale is obtained by dividing the compressed weight parameters into each module;
the hardware architecture construction module is used for constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
the module optimizing module is used for carrying out simulation experiments according to the divided modules, the constructed hardware architecture and the set calling sequence, finding out modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process, optimizing the found modules, designing the calculation flow of the convolution module, and then carrying out iteration according to hardware resource consumption and time delay of the original convolution neural network model to achieve the optimal operation effect, so as to obtain an optimized software implementation scheme;
the calculation optimization module is used for redesigning the calculation sequence of the module influencing the calculation efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process based on the optimized software implementation scheme, and designing the whole hardware architecture scheme according to the hardware resource consumption and the time delay of the original convolutional neural network model.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, implements the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor implementing the method according to the first aspect when executing said program.
Compared with the prior art, the invention has the following advantages and beneficial technical effects:
the invention provides a software and hardware collaborative optimization convolutional neural network calculation method, which comprises the steps of firstly analyzing an original convolutional neural network model, then compressing weight parameters of the original convolutional neural network model by adopting quantization operation, dividing the model into functional modules and constructing a hardware architecture comprising the functional modules, optimizing the model by simulation experiments, redesigning calculation sequences of the model in the hardware architecture implementation process based on an optimized software implementation scheme, and designing an overall hardware architecture scheme according to hardware resource consumption and time delay. The invention accelerates the computation of CNN in the form of cooperation of software and hardware, designs the hardware architecture supporting the model from the hardware perspective while reasonably designing the neural network model architecture and the parameter quantity, and ensures the accuracy and the high efficiency of the whole computing system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a software and hardware collaborative optimization convolutional neural network calculation method in an embodiment of the invention;
FIG. 2 is a diagram of the overall hardware architecture of a system designed in a reconfigurable computing platform in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of a calculation flow and implementation framework of a convolution module in a hardware architecture according to an embodiment of the present invention;
FIG. 4 is a block diagram of a computer-readable storage medium according to an embodiment of the present invention;
fig. 5 is a block diagram of a computer device in an embodiment of the invention.
Detailed Description
The inventor of the present application found through a great deal of research and practice that:
the existing method mainly depends on network optimization of CNN or design of hardware accelerator according to the characteristics of CNN. In the current research, a novel deep neural network model needs to be matched with a hardware architecture, so that model characteristics can not be fully utilized under a general hardware architecture; the neural network model accelerator design does not fully consider the characteristics of the model itself, and therefore more efficient calculation cannot be realized. In practice, an important guiding idea for constructing the efficient AI system is that the software and hardware are cooperatively designed, and the algorithm and the hardware architecture of the neural network are mutually affected in the calculation process, so that the efficient AI system has quite high coupling degree. Therefore, it is necessary to combine software optimization with hardware optimization to perform optimization in a form of cooperation of software and hardware.
Based on the above consideration, the invention provides a method and a device for optimizing convolutional neural network calculation by cooperation of software and hardware, which accelerate CNN calculation in a form of cooperation of software and hardware, design a hardware architecture supporting the model from the hardware perspective while reasonably designing the neural network model architecture and the parameter quantity, and ensure the accuracy and the high efficiency of the whole calculation system.
In order to achieve the above object, the present invention is mainly conceived as follows:
(1) Starting from the variant convolution of the SheffeNetV 2 model, carrying out collaborative design, realizing quantization, improving a convolution calculation unit and optimizing data arrangement;
(2) The method comprises the steps of realizing the layout of a SheffleNetV 2 model on an FPGA development board and carrying out module analysis;
(3) The method of the invention is verified on the FPGA, and experimental data show that the method of the invention improves the resource utilization rate and reduces the delay, and certain optimization is realized for the SheffeNetv 2 by a software-hardware cooperation method.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The embodiment of the invention provides a method for computing a software and hardware collaborative optimization convolutional neural network, which comprises the following steps:
s1: analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model;
s2: compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain compressed weight parameters;
s3: dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the module parallel number, and the parameter scale is obtained after dividing the compressed weight parameters into each module;
s4: constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module, and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
s5: according to the divided modules, the constructed hardware architecture and the set calling sequence, simulation experiments are carried out, modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process are found, the found modules are optimized, the calculation flow of a convolution module is designed, and then iteration is carried out according to hardware resource consumption and time delay of an original convolution neural network model so as to achieve the optimal operation effect, so that an optimized software implementation scheme is obtained;
s6: based on the optimized software implementation scheme, the calculation sequence of the module influencing the calculation efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process is redesigned, and the whole hardware architecture scheme is designed according to the hardware resource consumption and the time delay of the original convolutional neural network model.
In the method in the prior art, a novel deep neural network model needs to be matched with a hardware architecture, so that model characteristics can not be fully utilized under a general hardware architecture; the neural network model accelerator design does not fully consider the characteristics of the model itself, and therefore more efficient calculation cannot be realized. In practice, the software algorithm and the hardware architecture of the neural network are mutually affected in the calculation process, and have quite high coupling degree. Therefore, the invention accelerates the computation of CNN in the form of cooperation of software and hardware, designs the hardware architecture supporting the model from the hardware perspective while reasonably designing the neural network model architecture and the parameter quantity, and ensures the accuracy and the high efficiency of the whole computing system.
In the AI system design flow, firstly, a deep learning algorithm is designed according to a target application scene, and secondly, a neural network model is optimized to realize hardware acceleration. The optimization content generally comprises model compression and fixed-point quantization to reduce the workload and improve the peak performance of the AI accelerator, and then adjusts the hardware architecture design according to the algorithm optimization strategy, wherein the three steps are repeatedly and iteratively performed to ensure that the index requirements of the target application scene are met. After the hardware architecture design is completed, a customized software compiler is used for converting the neural network model into an instruction sequence executed during running, and the software compiler automatically completes scheduling and operator optimization to further improve the calculation efficiency of hardware. The software and hardware collaborative design takes a reconfigurable computing technology as a core, fully exerts the performance of the programmable device, and optimizes the performance of the neural network model better.
Referring to fig. 1, a flowchart of an implementation method of a software and hardware collaborative optimization convolutional neural network in an embodiment of the invention is shown.
Specifically, step S1 evaluates the hardware resource consumption and latency of the model, and the effect of the hardware resource consumption and latency obtained by the evaluation is to balance the resource consumption and the operation efficiency. Since different hardware platforms have different amounts and proportions of resources, dynamic adjustments are required to choose to increase resource consumption to increase parallelism or decrease resource consumption to increase multiplexing rate. The original convolutional neural network model may be a ShuffleNetv2 model or other model based on conventional convolutional calculations.
The specific roles of the hardware resource consumption and the time delay obtained by evaluation are as follows: providing an acceptable range of parameter quantity in the quantization operation of the step S2, providing the lowest multiplexing rate of each module in the division operation of the step S3, providing a range of time delay and resource consumption feasibility in the simulation operation of the step S5, and providing index requirements of each level in the design operation of the step S6, wherein the resource consumption index comprises: DSP, BRAM, FF, LUT; the operation efficiency index includes: FPS, GOPS, LATENCY, clock.
The compressed weight parameter obtained in step S2 is used to reduce the storage pressure and reduce or eliminate the time delay of accessing the external memory.
In step S3, the calling frequency is equal to the module multiplexing rate multiplied by the module parallel number, and the calling frequency provides an optimization range for step S5; the parameter scale is to divide the parameters obtained in step S2 into each module.
In step S4, each module is implemented on a different hardware resource area, and the modules may be implemented multiple times to implement tiling to increase parallelism. And step S5, after all modules and calculation flows in the step S4 are determined, simulation test is carried out on the whole reasoning process. In the implementation process, a module which has serious influence on the calculation efficiency and a module with overlarge calling frequency in the hardware implementation process can be found out according to a simulation running report generated by the Vitis_ HLS (High Level Synthesis).
And step S6, based on the optimized implementation scheme of the step S5, redesigning the calculation sequence of the module in the optimizing operation of the step S5 in the hardware architecture, realizing assembly line, adjusting parallelism, and designing the whole hardware architecture scheme according to the time delay and resource consumption requirements.
In general, the invention discloses a method and a device for calculating a software and hardware collaborative optimization convolutional neural network based on a thought of software and hardware collaborative optimization, which are used for carrying out module analysis and collaborative design on a lightweight model ShuffleNetv2 of CNN and realizing the software and hardware collaborative optimization network by means of realizing quantization, improving a calculation unit and optimizing a shuffling module. HLS (High Level Synthesis) is used for verifying experimental work of software and hardware collaborative acceleration of the calculation of the SheffleNetV 2 network in an FPGA platform xilinxzynqxc-7Z 045. The invention obviously improves the resource utilization rate and the time delay of the optimized CNN lightweight model ShuffeNetv 2.
In one embodiment, the original convolutional neural network model is a ShuffleNetV2 model.
In one embodiment, step S5 of optimizing the found modules includes:
the hardware resources are tiled aiming at the modules with the calculation efficiency lower than the preset value so as to increase the parallel lines;
and performing function splitting on the modules with calling frequencies exceeding the threshold value to reduce the calling times of the single module.
Specifically, hardware resource tiling is performed on a module with low computing efficiency (the computing efficiency is lower than a preset value) to increase the number of parallel lines, so that the influence on the overall operation efficiency can be reduced.
In one embodiment, the convolution module includes a channel dividing unit and a calculating unit, and the calculating process of designing the convolution module in step S5 includes:
the characteristic image is forwarded to a convolution module, an identifier Flag is set to distinguish the category of the convolution operation, wherein when flag=pw, the operation is channel convolution; this operation is a spatial convolution when flag=dw; this operation is a normal convolution when flag=con;
the channel segmentation unit detects whether an input characteristic image is divided into two parts according to the dimension direction, if not, the input characteristic image is segmented, and the segmented characteristic image is transmitted to the calculation unit, wherein the calculation unit comprises a plurality of PE and an addition tree;
and performing multiplication operation on the weights of the segmented feature images and the PE in parallel through a computing unit, then completing convolution operation through a group of addition trees, and finally performing addition operation on each output result and the offset to obtain a convolved three-dimensional feature matrix image.
Referring to fig. 3, a calculation flow and an implementation frame diagram of a convolution module in a hardware architecture in an embodiment of the present invention are shown;
in one embodiment, step S6 includes:
compressing data through parameter quantization operation to enable a convolutional neural network model to be stored on a block random access memory number, wherein data reading is realized by an AXI4 bus, and an ARM core command is transmitted through the AXI4 bus;
when the convolutional neural network model is calculated, an interrupt signal is generated, the interrupt signal is returned to the ARM core through terminal control, the completion of the reasoning calculation part is marked, and the ARM core takes out the reasoning result from the block random access memory number to complete the whole reasoning process;
and (3) carrying out overall control by using an ARM check system on the FPGA, calling a convolutional neural network model to carry out reasoning, and recording information and parameter configuration. The system refers to a model reasoning system, and comprises all modules (a data loading module, a data storage module, a full connection module and the like) realized by the model reasoning system, an internal memory and an external memory.
Referring to fig. 2, a system overall hardware architecture designed in a reconfigurable computing platform according to an embodiment of the invention is shown.
Specifically, step S6 designs an overall hardware architecture scheme based on a reconfigurable computing platform, and uploads and stores weights on a BRAM (block random access memory) through data quantization, deploys the whole system on an FPGA, and controls a system flow through an ARM core to realize an overall architecture of the system.
In one embodiment, the method further comprises:
and predicting the original data by using the deployed original neural network model to obtain a prediction result.
Specifically, after step S6, S7 is further included: and the feasibility of the hardware structure is integrally evaluated, and a hardware architecture scheme which balances the resource consumption and the time delay is obtained. And predicting the original data by using a deployed original neural network model (such as a SheffeNetv 2) to obtain a prediction result.
To illustrate the computationally optimal effect of the present invention on convolutional neural networks, the experimental results of tables 1 and 2 can be seen. Table 1 shows the comparison of the present invention with the run time and accuracy of the same number of parameters, shuffeNetv 2, running on CPU, GPU. In terms of accuracy, the method is only reduced by 0.813% compared with the original method running on the CPU, and the method can prove that the method has no excessive damage to the accuracy of the model. At the same time, the parameter amounts are kept consistent with each other in addition to the data cache for releasing the data dependence. On the basis of the above, the running efficiency of the invention is still 9.8x faster than that of a single-core CPU and 1.5x faster than that of a GPU.
TABLE 1 comparison of the run time and accuracy of the present invention with CPU, the same parameter SheffeNetv 2 model run on GPU
Index/method The invention (FPGA XC7Z 045) GPU CPU
Run time 28.78 43.4 283.8
top-1 accuracy 75.027% 75.84% 75.1%
TABLE 2 comparison of the same parameter SheffleNetv 2 model run with other methods of the invention on other different acceleration indicators
The data comparison of the experiments of the present invention with baseline experiments and other method acceleration is shown in Table 2, and includes the metrics DSP (digital signal processing element count), BRAM (block random memory count), FF (trigger count), LUT (Look-Up-Table), FPS (number of transmission frames per second), GOPS (time to 10 hundred million operations per second), a universal metric for evaluating neural network computational power, LATENCY (delay rate), clock (Clock frequency). Wherein the baseline model is pre-optimization data, and methods 1-3 are baseline methods of the invention that require comparison. DSP, BRAM, FF, LUT is the use condition of hardware resources, the BRAM of the baseline model is up to 1276, the use amount of the DSP is only 95, the resource utilization rate is low, and the memory space consumption is high. The optimized model greatly improves the use amount of the DSP and reduces the use of the BRAM. Meanwhile, the optimized result in terms of time delay is only 26.3% of the original result. The data set used in method 1 is the same as the present invention, and the Latency of the present invention is only 65% to 72% of it, with approximately the same hardware resource consumption. Paper 2 is the same model as used in the present invention, and it is noted that method 2 is a test in the ImageNet dataset, so the references to FPS parameters are in doubt. In addition, the equipment used in the method 2 is ZU3EG, the clock frequency reaches 250MHz, the clock frequency of the equipment used in the method is only 100MHz, and finally, the result proves that the GOPS of the invention is improved by 1.59x, and the Latency is also reduced from 104.3ms to 28.78ms. Whereas the method of the present invention is accelerated by changing the method of rearranging the data compared to method 3, the time delay of the method of the present invention is still only 32% thereof, and method 3 is similar to the method proposed by the present invention. The comparison result proves that the method of the invention improves the resource utilization rate and reduces the time delay.
Example two
Based on the same inventive concept, the embodiment provides a device for computing a software and hardware collaborative optimization convolutional neural network, which comprises:
the analysis module is used for analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model;
the compression module is used for compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain the compressed weight parameters;
the software design module is used for dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the parallel data of the modules, and the parameter scale is obtained by dividing the compressed weight parameters into each module;
the hardware architecture construction module is used for constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
the module optimizing module is used for carrying out simulation experiments according to the divided modules, the constructed hardware architecture and the set calling sequence, finding out modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process, optimizing the found modules, designing the calculation flow of the convolution module, and then carrying out iteration according to hardware resource consumption and time delay of the original convolution neural network model to achieve the optimal operation effect, so as to obtain an optimized software implementation scheme;
the calculation optimization module is used for redesigning the calculation sequence of the module influencing the calculation efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process based on the optimized software implementation scheme, and designing the whole hardware architecture scheme according to the hardware resource consumption and the time delay of the original convolutional neural network model.
Because the device described in the fourth embodiment of the present invention is a device used for implementing the method for optimizing the calculation of the convolutional neural network by the cooperation of software and hardware in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the device, and therefore, the detailed description thereof is omitted herein. All devices used in the method of the first embodiment of the present invention are within the scope of the present invention.
Example III
As shown in fig. 4, based on the same inventive concept, the present invention also provides a computer-readable storage medium 300, on which a computer program 311 is stored, which program when executed implements the method as described in embodiment one.
Because the computer readable storage medium described in the third embodiment of the present invention is a computer readable storage medium used for implementing the method for optimizing the calculation of the convolutional neural network by combining the software and hardware according to the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the modification of the computer readable storage medium, and therefore, the details are not repeated here. All computer readable storage media used in the method according to the first embodiment of the present invention are included in the scope of protection.
Example IV
Based on the same inventive concept, the present application also provides a computer device, as shown in fig. 5, including a memory 401, a processor 402, and a computer program 403 stored in the memory and capable of running on the processor, where the processor 402 implements the method in the first embodiment when executing the program.
Because the computer device described in the fourth embodiment of the present invention is a computer device used for implementing the method for optimizing the calculation of the convolutional neural network by the cooperation of software and hardware in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and the deformation of the computer device, and therefore, the details are not repeated here. All computer devices used in the method of the first embodiment of the present invention are within the scope of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims and the equivalents thereof, the present invention is also intended to include such modifications and variations.

Claims (8)

1. The method for computing the software and hardware collaborative optimization convolutional neural network is characterized by comprising the following steps of:
s1: analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model, wherein indexes related to the hardware resource consumption and the time delay comprise the number of digital signal processing elements, the number of block random access memories, the number of triggers, the number of lookup tables, the number of transmission frames per second, the time for 10 hundred million times of operations per second, the delay rate and the clock frequency;
s2: compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain compressed weight parameters;
s3: dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the module parallel number, and the parameter scale is obtained after dividing the compressed weight parameters into each module;
s4: constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module, and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
s5: according to the divided modules, the constructed hardware architecture and the set calling sequence, simulation experiments are carried out, modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process are found, the found modules are optimized, the calculation flow of a convolution module is designed, and then iteration is carried out according to hardware resource consumption and time delay of an original convolution neural network model so as to achieve the optimal operation effect, so that an optimized software implementation scheme is obtained;
s6: based on the optimized software implementation scheme, redesigning the calculation sequence of the module influencing the calculation efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process, and designing the whole hardware architecture scheme according to the hardware resource consumption and the time delay of the original convolutional neural network model;
wherein, the optimizing of the found modules in step S5 includes:
the hardware resources are tiled aiming at the modules with the calculation efficiency lower than the preset value so as to increase the parallel lines;
and performing function splitting on the modules with calling frequencies exceeding the threshold value to reduce the calling times of the single module.
2. The method for computing a software and hardware collaborative optimization convolutional neural network of claim 1, wherein the original convolutional neural network model is a ShuffleNetV2 model.
3. The method for computing the software and hardware collaborative optimization convolutional neural network according to claim 1, wherein the convolutional module comprises a channel segmentation unit and a computing unit, and the computing process of designing the convolutional module in step S5 comprises the following steps:
the characteristic image is forwarded to a convolution module, an identifier Flag is set to distinguish the category of the convolution operation, wherein when flag=pw, the operation is channel convolution; this operation is a spatial convolution when flag=dw; this operation is a normal convolution when flag=con;
the channel segmentation unit detects whether an input characteristic image is divided into two parts according to the dimension direction, if not, the input characteristic image is segmented, and the segmented characteristic image is transmitted to the calculation unit, wherein the calculation unit comprises a plurality of PE and an addition tree;
and performing multiplication operation on the weights of the segmented feature images and the PE in parallel through a computing unit, then completing convolution operation through a group of addition trees, and finally performing addition operation on each output result and the bias to obtain a convolved three-dimensional feature matrix.
4. The method for computing a software-hardware co-optimized convolutional neural network of claim 1, wherein step S6 comprises:
compressing data through parameter quantization operation to enable a convolutional neural network model to be stored on a block random access memory number, wherein data reading is realized by an AXI4 bus, and an ARM core command is transmitted through the AXI4 bus;
when the convolutional neural network model is calculated, an interrupt signal is generated, the interrupt signal is returned to the ARM core through terminal control, the completion of the reasoning calculation part is marked, and the ARM core takes out the reasoning result from the block random access memory number to complete the whole reasoning process;
and (3) carrying out overall control by using an ARM check system on the FPGA, calling a convolutional neural network model to carry out reasoning, and recording information and parameter configuration.
5. The method for computing a software and hardware co-optimized convolutional neural network of claim 1, further comprising:
and predicting the original data by using the deployed original neural network model to obtain a prediction result.
6. The method for computing the software and hardware collaborative optimization convolutional neural network is characterized by comprising the following steps of:
the analysis module is used for analyzing the original convolutional neural network model to obtain hardware resource consumption and time delay of the original neural network model, wherein indexes related to the hardware resource consumption and the time delay comprise the number of digital signal processing elements, the number of block random access memories, the number of triggers, the number of lookup tables, the time of 10 hundred million times per second of transmission frame number and the clock frequency;
the compression module is used for compressing the weight parameters of the original convolutional neural network model by adopting quantization operation to obtain the compressed weight parameters;
the software design module is used for dividing an original convolutional neural network model into a data loading module, a data storage module, a full connection module, a convolutional module and a pooling module according to functions, and counting the calling frequency and the parameter scale of each module, wherein the calling frequency module of each module is obtained according to the module multiplexing rate and the parallel data of the modules, and the parameter scale is obtained by dividing the compressed weight parameters into each module;
the hardware architecture construction module is used for constructing a hardware architecture sequentially comprising a data loading module, a data storage module, a full connection module, a convolution module and a pooling module and setting the calling sequence of each module, wherein the data loading module is used for reading data in batches according to calculation requirements; the data storage module is used for temporarily storing the intermediate results calculated by all other modules in the internal storage; the full connection module, the convolution module and the pooling module are respectively used for realizing full connection calculation, convolution calculation and pooling operation, and after multiple iterations of convolution, pooling and activation functions, the results are classified through a full connection layer;
the module optimizing module is used for carrying out simulation experiments according to the divided modules, the constructed hardware architecture and the set calling sequence, finding out modules with calculation efficiency lower than a preset value and modules with calling frequency exceeding a threshold value in the hardware implementation process, optimizing the found modules, designing the calculation flow of the convolution module, and then carrying out iteration according to hardware resource consumption and time delay of the original convolution neural network model to achieve the optimal operation effect, so as to obtain an optimized software implementation scheme;
the computing optimization module is used for redesigning the computing sequence of the module influencing the computing efficiency and the module with the calling frequency exceeding the threshold value in the hardware architecture implementation process based on the optimized software implementation scheme, and designing the whole hardware architecture scheme according to the hardware resource consumption and the time delay of the original convolutional neural network model;
wherein optimizing the found modules in the module optimization module comprises:
the hardware resources are tiled aiming at the modules with the calculation efficiency lower than the preset value so as to increase the parallel lines;
and performing function splitting on the modules with calling frequencies exceeding the threshold value to reduce the calling times of the single module.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method according to any one of claims 1 to 5.
8. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when the program is executed.
CN202210642934.XA 2022-06-08 2022-06-08 Software and hardware collaborative optimization convolutional neural network calculation method and device Active CN115130672B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210642934.XA CN115130672B (en) 2022-06-08 2022-06-08 Software and hardware collaborative optimization convolutional neural network calculation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210642934.XA CN115130672B (en) 2022-06-08 2022-06-08 Software and hardware collaborative optimization convolutional neural network calculation method and device

Publications (2)

Publication Number Publication Date
CN115130672A CN115130672A (en) 2022-09-30
CN115130672B true CN115130672B (en) 2024-03-08

Family

ID=83377605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210642934.XA Active CN115130672B (en) 2022-06-08 2022-06-08 Software and hardware collaborative optimization convolutional neural network calculation method and device

Country Status (1)

Country Link
CN (1) CN115130672B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115906917B (en) * 2022-11-09 2024-01-30 武汉大学 Neural network model deployment method and device based on model algorithm division

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111931918A (en) * 2020-09-24 2020-11-13 深圳佑驾创新科技有限公司 Neural network accelerator
CN113570036A (en) * 2021-07-08 2021-10-29 清华大学 Hardware accelerator architecture supporting dynamic neural network sparse model
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA
CN113821981A (en) * 2021-10-08 2021-12-21 上海交通大学 Method and device for constructing convolutional neural network data flow design space analysis tool

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239824A (en) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 Apparatus and method for realizing sparse convolution neutral net accelerator

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392308A (en) * 2017-06-20 2017-11-24 中国科学院计算技术研究所 A kind of convolutional neural networks accelerated method and system based on programming device
CN111178518A (en) * 2019-12-24 2020-05-19 杭州电子科技大学 Software and hardware cooperative acceleration method based on FPGA
CN111198817A (en) * 2019-12-30 2020-05-26 武汉大学 SaaS software fault diagnosis method and device based on convolutional neural network
CN111931918A (en) * 2020-09-24 2020-11-13 深圳佑驾创新科技有限公司 Neural network accelerator
CN113570036A (en) * 2021-07-08 2021-10-29 清华大学 Hardware accelerator architecture supporting dynamic neural network sparse model
CN113792621A (en) * 2021-08-27 2021-12-14 杭州电子科技大学 Target detection accelerator design method based on FPGA
CN113821981A (en) * 2021-10-08 2021-12-21 上海交通大学 Method and device for constructing convolutional neural network data flow design space analysis tool

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Hardware and Algorithm Co-Optimization for pointwise convolution and channel shuffe in ShuffeNet V2;Zimeng Fan et.al;《2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC)》;第3212-3217页 *
Software-Hardware Co-Optimization for CNNs Based on Reconfigurable Devices;Fang Liu et.al;《2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)》;第1279-1284页 *
Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?;Carballo-Hernandez, W. et.al;《arXiv》;20210416;第1-6页 *
基于Kaldi的AI语音识别在嵌入式系统中的应用研究;彭燕子;柏杰;曹炳尧;宋英雄;;工业控制计算机;20200925(09);第67-70页 *
基于改进动态配置的FPGA卷积神经网络加速器的优化方法;陈朋;陈庆清;王海霞;张怡龙;刘义鹏;梁荣华;;高技术通讯(03);第32-39页 *

Also Published As

Publication number Publication date
CN115130672A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
CN108280514B (en) FPGA-based sparse neural network acceleration system and design method
JP6726246B2 (en) Method and apparatus for performing operations in a convolutional neural network and non-transitory storage medium
US20220012593A1 (en) Neural network accelerator and neural network acceleration method based on structured pruning and low-bit quantization
Guo et al. A survey of FPGA-based neural network accelerator
Lu et al. SpWA: An efficient sparse winograd convolutional neural networks accelerator on FPGAs
CN108205701B (en) System and method for executing convolution calculation
CN111898733B (en) Deep separable convolutional neural network accelerator architecture
CN111667051A (en) Neural network accelerator suitable for edge equipment and neural network acceleration calculation method
US20200210843A1 (en) Training and application method of a multi-layer neural network model, apparatus and storage medium
CN113469350B (en) Deep convolutional neural network acceleration method and system suitable for NPU
CN112200300B (en) Convolutional neural network operation method and device
CN112734020B (en) Convolution multiplication accumulation hardware acceleration device, system and method of convolution neural network
CN115130672B (en) Software and hardware collaborative optimization convolutional neural network calculation method and device
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN111240746A (en) Floating point data inverse quantization and quantization method and equipment
CN112771546A (en) Operation accelerator and compression method
CN114005458A (en) Voice noise reduction method and system based on pipeline architecture and storage medium
Zhang et al. Hardware-software codesign of weight reshaping and systolic array multiplexing for efficient CNNs
CN112200310A (en) Intelligent processor, data processing method and storage medium
CN109740733B (en) Deep learning network model optimization method and device and related equipment
Huang et al. Pushing the envelope of dynamic spatial gating technologies
CN118043821A (en) Hybrid sparse compression
TWI798591B (en) Convolutional neural network operation method and device
WO2021120036A1 (en) Data processing apparatus and data processing method
Wan et al. ADS-CNN: Adaptive Dataflow Scheduling for lightweight CNN accelerator on FPGAs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant