CN111191772A

CN111191772A - Intelligent computing general acceleration system facing embedded environment and construction method thereof

Info

Publication number: CN111191772A
Application number: CN202010003133.XA
Authority: CN
Inventors: 李欣瑶; 刘飞阳; 高泽; 白林亭; 文鹏程; 李亚晖
Original assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Current assignee: Xian Aeronautics Computing Technique Research Institute of AVIC
Priority date: 2020-01-02
Filing date: 2020-01-02
Publication date: 2020-05-22
Anticipated expiration: 2040-01-02
Also published as: CN111191772B

Abstract

The invention belongs to the field of intelligent computing, and discloses an intelligent computing general acceleration system facing an embedded environment and a construction method thereof. The method comprises the following steps: the method comprises the following steps of (1) pre-training a deep neural network, carrying out structural analysis on a deep neural network model, and extracting structural features of the model; a reconfigurable stage, analyzing the structure analysis data of the model, completing the calculation acceleration of each layer of the network based on programmable logic, and realizing the dynamic storage and the dynamic configuration of the network model based on reconfigurable hardware resources; and in the post-processing stage, the acceleration effect is verified, the result is output, and the programmable calculation acceleration unit can be expanded. The method has obvious acceleration effect on the complex deep neural network, has good flexibility and applicability, effectively reduces the storage and calculation cost required by intelligent calculation, and provides technical support for the deep neural network meeting different application requirements to be simultaneously deployed in an embedded environment with limited resources.

Description

Intelligent computing general acceleration system facing embedded environment and construction method thereof

Technical Field

The invention belongs to the field of intelligent computing, and provides an intelligent computing acceleration method facing an embedded environment.

Background

In recent years, deep neural networks represented by convolutional neural networks CNN and recurrent neural networks RNN have a great advantage in dealing with complex intelligence problems such as computer vision and speech recognition, and can meet various application requirements: (1) performing auxiliary decision according to the comprehensive situation information; (2) and detecting, identifying and tracking the target in real time and the like. However, the performance advantages of the current deep neural network mainly depend on the huge scale of parameters and the high computing power of the multi-graphics processor integration, and the high storage and computing cost makes it almost impossible to directly deploy the deep neural network to the embedded device with limited hardware resources. Therefore, it is necessary to solve the problem of deployment and optimization of various deep neural networks in an embedded environment with limited resources.

At present, there are two main directions of research for implementing intelligent computing acceleration applied in an embedded environment: the method is realized based on a customizable special integrated chip, can customize and optimize a specific algorithm according to specific requirements, and has low power consumption and high calculation efficiency; however, the application-specific integrated chip lacks a uniform software and hardware development environment, has a long development period and poor flexibility and universality, can only accelerate a specific deep neural network, and is difficult to apply to an embedded environment in a short time. The other research direction is realized in an embedded mode based on a programmable semi-customized FPGA, the FPGA has excellent iteration speed and flexibility, the technology is autonomous and controllable, and the method is suitable for the inference stage of a deep neural network; but the design difficulty of the sequential logic of the FPGA is very large.

Therefore, the conventional acceleration method mainly aims at the neural network with a single structure, and is difficult to meet the multi-application requirement of the embedded environment at the same time.

Disclosure of Invention

Aiming at a set of various deep neural networks, the invention provides a reconfigurable intelligent computing acceleration method facing an embedded environment, which has universality and can pointedly accelerate the deep neural networks.

The technical scheme of the invention is as follows:

the intelligent computing general acceleration system for the embedded environment aims at a set of a plurality of deep neural networks (namely the deep neural network to be accelerated), and comprises a reconfigurable control unit, a reconfigurable storage unit and a programmable computing acceleration unit;

the reconfigurable storage unit adopts reconfigurable hardware resources and is used for dynamically configuring an optimal storage architecture for the loaded network model layer by layer so as to reduce data dependence on an external memory;

the programmable computation accelerating unit is packaged with a public processing unit library and a special processing unit library; the public processing unit library is used for accelerating the commonality part of the deep neural network and providing two working modes of CNN and RNN; the special processing unit library is used for accelerating the personality part of the deep neural network;

the reconfigurable control unit is used for analyzing the structure analysis data of the deep neural network model to be accelerated, loading the network model into the reconfigurable storage unit layer by layer, controlling the common processing unit library in the programmable computation acceleration unit to switch between two working modes of CNN and RNN, and sequentially completing the acceleration of each layer of network according to the operator in the special processing unit library required by the corresponding awakening or dormancy of the model structure analysis data.

Optionally, the common processing unit library contains convolution operators, activation function Sigmoid operators, activation function Tanh operators, activation function ReLU operators, LSTM operators, GRU operators, and full join operators.

Optionally, the specialized processing unit library includes floating-point multiply-add operators, update gate operators, forget gate operators, input gate operators, and output gate operators.

Optionally, the reconfigurable hardware resources include lookup tables (LUTs) and flip flops (flip flops) of the FPGA.

Correspondingly, the invention also provides a construction method of the intelligent computing general acceleration system facing the embedded environment, which comprises the following steps:

a pretreatment stage: pre-training a deep neural network to be accelerated, carrying out structural analysis on a deep neural network model, and extracting structural features of the model;

a reconfigurable stage: analyzing the structure analysis data of the model, completing the calculation acceleration of each layer of the network based on programmable logic, and realizing the dynamic storage and the dynamic configuration of the network model based on reconfigurable hardware resources;

and (3) post-treatment stage: verifying the acceleration effect of the reconfigurable stage; if the acceleration effect reaches the expectation, outputting the result according to the required format; if the acceleration effect is not expected, the model structure is further analyzed, the common processing unit library and the special processing unit library are updated and supplemented according to the structural characteristics of the model, new operator modules are packaged (correspondingly added into the common processing unit library and the special processing unit library), the reconfigurable stage is returned, and the model is accelerated again.

The pretreatment stage may specifically be: the weight parameters are obtained through pre-training, the network scale is preliminarily reduced by applying a lightweight technology, the structure of the network model is analyzed, and the structural characteristics of the network model are extracted.

The application of the lightweight technology to preliminarily reduce the network scale may specifically be: pruning, sparse coefficient, data quantization, Huffman coding and binary/ternary operation are adopted, and the parameter quantity of the network model is preliminarily reduced.

The invention has the following beneficial effects:

the reconfigurable intelligent computation acceleration method provided by the invention has universality and can pointedly accelerate the deep neural network (accelerate the acceleration operators in deep neural network switching processing unit libraries with different structures), and the defect that the existing network acceleration method can only accelerate a single network is overcome. Particularly, three processing units in a reconfigurable stage are matched efficiently and tightly, a reconfigurable control unit analyzes the model structure analysis data, a reconfigurable storage unit dynamically configures a storage space, on-chip resources are utilized to the maximum extent, a programmable computation acceleration unit fully exerts the advantage of FPGA dynamic reconfiguration, two extensible processing unit libraries are packaged, and the flexibility of network model hardware acceleration is effectively improved.

The method provided by the invention has an obvious acceleration effect on the complex deep neural network, has good flexibility and applicability, effectively reduces the storage and calculation cost required by intelligent calculation, and provides technical support for meeting different application requirements of the deep neural network to be deployed in an embedded environment with limited resources.

Drawings

FIG. 1 is a schematic flow diagram of a reconfigurable intelligent computing acceleration method oriented to an embedded environment;

FIG. 2 is a schematic diagram of a reconfigurable intelligent computing acceleration architecture.

Detailed Description

The present invention will be described in further detail below by way of examples with reference to the accompanying drawings.

In the embodiment, in order to meet multiple application requirements of embedded environment situation awareness (mainly depending on RNN) and target identification (mainly depending on CNN) and the like, a reconfigurable intelligent computing acceleration method is provided for a set of multiple deep neural networks, an intelligent computing universal acceleration architecture is built, and a programmable FPGA hardware platform is adopted to verify the intelligent computing universal acceleration architecture.

The reconfigurable intelligent computing acceleration method for the embedded environment comprises three stages, namely a preprocessing stage, a reconfigurable stage and a post-processing stage. As shown in fig. 1, the method specifically includes:

firstly, a pretreatment stage: and preprocessing the deep neural network model to be accelerated, and acquiring weight parameters through pre-training. Lightweight technologies such as pruning, sparse coefficient, data quantization, Huffman coding, binarization/ternary operation and the like are adopted, the parameter quantity of the network model is preliminarily reduced, the structure of the network model is analyzed, and a foundation is laid for intelligent calculation acceleration in a reconfigurable stage.

Secondly, a reconfigurable stage: because convolution operation in the CNN network consumes a large amount of computing resources and storage resources, and the data flow direction in the RNN network is complex, in order to meet the universality requirement of reconfigurable intelligent computing acceleration and simultaneously improve the network model performance in a targeted and maximized manner, a general intelligent computing acceleration architecture is designed and implemented, as shown in fig. 2. The architecture consists of a reconfigurable control unit, a programmable computation acceleration unit and a reconfigurable storage unit.

1. The reconfigurable control unit analyzes the model structure analysis data obtained in the preprocessing stage, loads the network model into the on-chip data storage layer by layer, controls the public processing unit library in the programmable calculation acceleration unit to switch between the CNN working mode and the RNN working mode, and wakes up or sleeps the operators in the special processing unit library according to the model structure analysis data to finish the acceleration of the layer network. This process is repeated until accelerated computations for all layers of the network model are completed.

2. Based on the idea of modularization, a common processing unit library and a special processing unit library are packaged in the programmable computing acceleration unit:

A. the common processing unit library can accelerate the commonality part of the deep neural network and is divided into two working modes of CNN and RNN, including convolution operator, activation function Sigmoid operator, activation function Tanh operator, activation function ReLU operator, LSTM operator, GRU operator, full-link operator and other operators;

B. the special processing unit library can accelerate the individual part of the deep neural network and comprises a floating point multiplication and addition operator, an updating gate operator, a forgetting gate operator, an input gate operator and an output gate operator.

3. The reconfigurable storage unit adopts a reconfigurable hardware structure, utilizes reconfigurable hardware resources such as lookup tables (LUTs) of the FPGA, triggers (flip flops) and the like, and dynamically configures an optimal storage architecture layer by layer according to a network model loaded by the reconfigurable control unit, so that the data dependence on an external memory is reduced, and the time delay and the power consumption caused by data transmission are reduced.

Because the common processing unit library is designed for the common part of the network model, the architecture has certain universality, and the special accelerated processing unit libraries are respectively designed aiming at the special structure of the network model, thereby ensuring the flexibility and the high efficiency of the architecture. The reconfigurable hardware structure is used for dynamically configuring the on-chip storage space, and high time delay and high power consumption caused by data communication are reduced. The framework has good expandability, and can adapt to the quick change of the deep neural network model by updating the processing unit library. Therefore, the advantage of FPGA dynamic reconfiguration is fully exerted, the inference speed of the deep neural network is effectively improved, and the method can be applied to an embedded environment with shortage of hardware resources.

Thirdly, post-treatment stage: and establishing a hardware comprehensive verification environment of the reconfigurable intelligent computing acceleration architecture by adopting the programmable FPGA, and verifying the acceleration effect of the reconfigurable stage. If the acceleration effect reaches the expectation, outputting the result according to the required format; if the acceleration effect is not expected, the calculation acceleration unit required by the network model is not in the current programmable calculation acceleration unit, the network model structure is further analyzed, the public (special) processing unit library is updated and supplemented according to the model characteristics, a new operator module is packaged and added into the public (special) processing unit library, the reconstruction stage is returned, and secondary acceleration processing is carried out on the model.

The embodiment fully considers the requirements of commonality and individuality of neural network models with different depths on hardware resources. Taking CNN network and RNN network as examples, the CNN network is composed of multiple convolutional layers, pooling layers, and full-link layers, except for the input layer and output layer, and their corresponding acceleration operators are all contained in the common (dedicated) processing unit library. The RNN network consists of multiple gated structures, containing two specific structures.

CNN network acceleration: and in the preprocessing stage, the CNN model is pre-trained to obtain weight parameters of the CNN model, the scale of the CNN network model is reduced by using a lightweight technology, the CNN network model is subjected to layer-by-layer structure analysis, and structure analysis data and the CNN network model are transmitted to the reconfigurable intelligent computation acceleration architecture layer by layer. And a reconfigurable stage for accelerating the network model. The reconfigurable control unit loads the network model into the reconfigurable storage unit, and the reconfigurable storage unit dynamically configures reasonable on-chip storage space for the reconfigurable storage unit. Meanwhile, the reconfigurable control unit matches the structure analysis data with the processing unit library in the programmable calculation acceleration unit, controls the public processing unit library to be switched to a CNN working mode, and wakes up the corresponding operator in the special processing unit library. And after the acceleration is finished, the post-processing stage feeds back the acceleration effect and outputs the result.

RNN network acceleration: the preprocessing stage is similar to the preprocessing of the CNN, and is subjected to pre-training and model lightweight, and the structure analysis data and the model parameters are transmitted to the reconfigurable intelligent computing acceleration architecture together. In the reconfigurable stage, the reconfigurable control unit loads the network model into the reconfigurable storage unit, and the reconfigurable storage unit dynamically configures reasonable on-chip storage space for the reconfigurable storage unit. Meanwhile, the reconfigurable control unit matches the structure analysis data with a processing unit library in the programmable calculation acceleration unit, controls the common processing unit library to be switched to an RNN working mode, and wakes up a corresponding operator in the special processing unit library. Because the network contains two special structures, when the post-processing stage is verified, the acceleration effect is not expected, so that the model is further structurally analyzed, operators in a public (special) processing unit library are updated according to the special structures, if the special structures are not adapted yet, new operator modules are packaged and added into the public (special) processing unit library, the operation returns to the reconfigurable stage, and the model is secondarily accelerated.

Claims

1. The utility model provides an intelligent computing universal acceleration system towards embedded environment, is directed against the set of multiple deep neural network which characterized in that: the reconfigurable computing acceleration unit comprises a reconfigurable control unit, a reconfigurable storage unit and a programmable computing acceleration unit;

2. The intelligent computing universal acceleration system towards embedded environment of claim 1, characterized in that: the public processing unit library comprises a convolution operator, an activation function Sigmoid operator, an activation function Tanh operator, an activation function ReLU operator, an LSTM operator, a GRU operator and a full-link operator.

3. The intelligent computing universal acceleration system towards embedded environment of claim 1, characterized in that: the special processing unit library comprises a floating point multiply-add operator, an update gate operator, a forget gate operator, an input gate operator and an output gate operator.

4. The intelligent computing universal acceleration system towards embedded environment of claim 1, characterized in that: the reconfigurable hardware resources comprise lookup tables (LUTs) and flip flops (flip flops) of the FPGA.

5. A method of constructing an intelligent computing universal acceleration system for embedded environments as recited in claim 1, comprising:

and (3) post-treatment stage: verifying the acceleration effect of the reconfigurable stage; if the acceleration effect reaches the expectation, outputting the result according to the required format; if the acceleration effect is not expected, the model structure is further analyzed, the common processing unit library and the special processing unit library are updated and supplemented according to the structural characteristics of the model, a new operator module is packaged, the reconfigurable stage is returned, and the model is accelerated again.

6. The method for constructing the intelligent computing general acceleration system facing the embedded environment according to claim 5, characterized in that: in the preprocessing stage, weight parameters are obtained through pre-training, the network scale is preliminarily reduced by applying a lightweight technology, the structure of the network model is analyzed, and the structural characteristics of the network model are extracted.

7. The method for constructing the intelligent computing universal acceleration system facing the embedded environment according to claim 6, characterized in that: the network scale is preliminarily reduced by applying the lightweight technology, and particularly, the parameters of a network model are preliminarily reduced by adopting pruning, sparse coefficients, data quantization, Huffman coding and binary/ternary operation.