CN116051356A

CN116051356A - Rapid style migration method based on image and FPGA system

Info

Publication number: CN116051356A
Application number: CN202310049200.5A
Authority: CN
Inventors: 陈盼盼; 孙莉; 张国和; 郑培清; 秦玉; 侯俊
Original assignee: Jiangsu Siyuan Integrated Circuit And Intelligent Technology Research Institute Co ltd
Current assignee: Jiangsu Siyuan Integrated Circuit And Intelligent Technology Research Institute Co ltd
Priority date: 2023-02-01
Filing date: 2023-02-01
Publication date: 2023-05-02

Abstract

The invention relates to the technical field of image processing, in particular to a rapid style migration method based on images and an FPGA system, which comprises the steps of optimizing a style migration network by improving a residual error network structure and a dual attention mechanism; and the hardware realization of the improved style migration network is realized by the software processing realized based on the ARM CPU and the programmable hardware logic processing realized based on the FPGA. The invention improves the residual error network to obviously reduce the parameters of the rapid style network of the image; the network structure of dual attention is increased, the definition of the content structure and the expression capability of the style characteristics are improved, and therefore the quality of the processed image is improved; in addition, hardware deployment is carried out on the FPGA, compared with the traditional image online processing which does not depend on network link, the reasoning speed of the model is not influenced by the quality of transmission signals, and the privacy of user data can be ensured.

Description

Rapid style migration method based on image and FPGA system

Technical Field

The invention relates to the technical field of image processing, in particular to a rapid style migration method based on images and an FPGA system.

Background

The image style migration technology is an image processing method for rendering semantic content by using different styles, and converts the artistic style of an image while guaranteeing the structure of the original image content to obtain the texture and aesthetic characteristics of the style image, so that the finally output generated image presents perfect combination of different image contents and styles. However, the existing image style migration technology has the problems that the content structure is unclear and the processed images are inconsistent in terms of color, texture, shape and the like.

Neural networks are an artificial intelligent machine learning technology, and particularly, deep convolutional neural networks have received a great deal of attention, and have achieved some remarkable results in the fields of speech recognition, natural language processing and intelligent image processing, particularly in the field of image recognition. However, the common network model achieves the magnitude of 10 hundred million in calculation amount, and achieves the magnitude of hundreds of megameters in parameter amount; for embedded equipment with tense resources and more sensitive power consumption, huge calculation and parameter quantity make the neural network put a more stringent requirement for the realization of the convolutional neural network.

The hardware accelerator platforms currently mainly have three types: graphics Processors (GPUs), application Specific Integrated Circuits (ASICs), and Field Programmable Gate Arrays (FPGAs). The GPU is widely applied to the neural network, but has higher power consumption; ASIC and FPGA are both high performance and low power consumption, where ASIC can design a dedicated architecture to accommodate the operation of a specific neural network architecture, but at high development and manufacturing costs; the FPGA is a programmable logic array, flexible in development and low in development cost, and is relatively suitable for design research of an accelerator as a functional simulation and verification platform before the current AI chip is subjected to film streaming.

FPGA (Field Programmable Gate Array) is a configurable logic gate circuit with configurable logic blocks and user input and output interface components, and research developers can build processing architectures capable of realizing different functions by configuring relevant switch states in the FPGA. The advantages of high-performance parallel computation, ultra-low power consumption, low cost and the like of the FPGA are fully utilized to research a high-performance architecture for realizing the convolutional neural network, and the method is one of the necessary trends in the field of artificial intelligence.

Disclosure of Invention

Aiming at the defects of the existing algorithm, the invention improves the residual error network to obviously reduce the parameters of the rapid style network of the image; the network structure of dual attention is increased, the definition of the content structure and the expression capability of the style characteristics are improved, and therefore the quality of the processed image is improved; in addition, hardware deployment is carried out on the FPGA, compared with the traditional image online processing which does not depend on network link, the reasoning speed of the model is not influenced by the quality of transmission signals, and the privacy of user data can be ensured.

The technical scheme adopted by the invention is as follows: a rapid style migration method based on images comprises the following steps:

step one, optimizing a style migration network by improving a residual error network structure and a dual attention mechanism;

further, improving the residual network reduces 256 dimensions to 64 dimensions under one 1x1 convolution layer first by the middle 3x3 convolution layer, and then increases the dimensions to 256 by the 1x1 convolution layer, thereby improving the residual network and reducing the calculation amount while maintaining the accuracy.

Further, the improved dual attention mechanism comprises a position attention module and a channel attention module, wherein the position attention module selectively gathers each position feature by calculating a weighted sum of fusion features of the whole space, and similar features are related to each other no matter the distance; the channel attention module selectively emphasizes the channel graphs related to each other by integrating the related features in all the fusion channel graphs, and finally adds the enhancement results of the two attention modules.

Further, the feature map formula of the position attention module is:

wherein K is ^s Is a position attention profile; θ ₁ Weight coefficient; f (f) _i ³ Is the line i feature of position attention;

is a location attention mask; f (f) _cs Is a preliminary fusion feature map.

Further, the characteristic diagram formula of the channel attention module is as follows:

wherein θ ₂ Weight coefficient;

is a channel attention mask; f (f) _i ^3t Is the channel attention line i feature; f (f) _cs Is a preliminary fusion feature map.

Further, the two attention module strengthening results are added to obtain a fusion feature map:

wherein K is ^s K is a feature map of the position attention module ^c Is a feature map of the channel attention module.

And step two, performing software processing based on ARM CPU and programmable hardware logic processing based on FPGA to realize the hardware implementation of the improved style migration network.

Further, the FPGA system based on the rapid style migration of the image comprises: the device comprises a register module, a data interface module, a convolution module, a pooling module and a control module, wherein the register comprises a memory address and network layer parameters, the data interface module is used for completing data conversion of an accelerator and an external internal size interface, the convolution module is used for completing acceleration operation of a convolution layer and a full connection layer, the pooling module is used for completing pooling operation, and the control module is used for completing scheduling of each module.

Further, the convolution module includes: the system comprises a DMA module, an on-chip cache module, a logic control module, a convolution calculation module and a read-write address module, wherein the DMA module is responsible for high-efficiency transmission of data between an acceleration circuit and an external memory; the on-chip cache module stores the characteristic and weight data, and the read-write address module calculates the characteristic weight address in the BRAM; the convolution calculation module is responsible for multiply-accumulate calculation; the logic control module performs scheduling of each module; and the AXI_lite bus carries out signal transmission of PS and PL ends, the on-chip buffer writes register data corresponding to each convolution layer into the hardware accelerator, and the PL end hardware is scheduled to calculate the convolution layers.

Further, the control method of the convolution module comprises the following steps:

the data is stored in a buffer unit after being calculated in blocks by adopting a multiply-accumulate tree structure;

calculating the size N of the convolution kernel of the input channel N and the output channel N, reading the characteristic data of the N in each clock period, copying N copies, reading the weight data of the input N in from the on-chip buffer, and obtaining a calculation result after multiplying the weight data by the accumulator;

and writing the feature weight into a read-write address module in real time, and giving the read feature weight and the read weight to a DMA (direct memory access) control DDR, storing the read feature and the read weight into a BRAM (binary data memory) after the calculation of the BRAM, simultaneously calculating a read-out address of the BRAM, carrying out padding zero padding, and then giving the data to a convolution calculation module.

Furthermore, the pooling module uses an AXI bus to transmit input and output data, the pooling parameter is transmitted to the pooling module by the DDR through the AXI_lite bus configuration register, and then the data is written into the DDR.

The invention has the beneficial effects that:

1. the problems of blurring of a generated image structure and unclear edges caused by insufficient feature fusion are solved, and the expression capability of important features of the image is improved;

2. the total unit structure parameter is reduced from 1179648 to 69632 by about 94%, and the calculated amount is reduced while the precision is maintained;

3. the FPGA has the advantages of extremely small area, higher speed and lower power consumption, is suitable for equipment with limited resources, and has good application prospect. The method is realized through FPGA hardware, the dependence of the traditional online style migration network on network connection is eliminated, style migration can be completed in real time in an offline state, and user privacy and information security can be protected in part of application scenes.

Drawings

FIG. 1 is an image style migration network of the present invention;

FIG. 2 is a diagram of a residual network structure before and after modification;

FIG. 3 is a hardware system architecture of the present invention;

FIG. 4 is a flow chart of the data flow control of the present invention;

FIG. 5 is a block diagram of a convolution module of the present invention;

FIG. 6 is a block diagram of a fully connected module of the present invention;

FIG. 7 is a dual attention model of the present invention;

FIG. 8 is a diagram of an image rapid style migration network incorporating dual attention mechanisms of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic illustrations showing only the basic structure of the invention and thus showing only those constructions that are relevant to the invention.

And transmitting weight parameters of the image style migration algorithm trained in advance to a weight data buffer module, performing layer-by-layer calculation of the whole convolutional neural network by the convolutional module according to the buffered weight data, bias data and image data, finally obtaining content loss between the generated image and the content image and style loss between the generated image and the style image, and continuously calculating to reduce the loss.

And back-propagating the loss to the image generation network and optimizing the image generation network to obtain a qualified image style conversion model.

(1) Image style migration model compression:

as shown in FIG. 1, the model migration network mainly comprises an image generation network and a loss function network, wherein the loss network is a convolutional neural network, the network model parameters are obtained through pre-training, and the model parameters cannot be changed in the whole conversion process.

The image generation network mainly comprises a downsampling convolution layer, a residual error network and an upsampling convolution layer; the image generation network carries out forward and backward style conversion on the picture, and the loss network carries out constraint on characteristic data of the image during training; firstly, a pre-trained part of VGG network is adopted to carry out downsampling operation, so that a characteristic diagram is reduced, the network is deepened, and the calculated amount is reduced; and the residual network performs normalization operation on the content and the style images, performs feature mapping statistic matching to realize image normalization, then performs self-adaptive feature space mapping to generate a target picture, and then restores the target picture into an image through up-sampling, and continuously updates normalization layer parameters gamma and beta to enable the network to converge, so that the self-adaptive normalization layer realizes conversion of any style.

The model trained by the image style migration algorithm is strong in performance, but the size of a single style model is about 20M, the storage and calculation cost is high, great challenges are generated for the storage capacity and the calculation capacity of a mobile terminal, and the model is difficult to deploy to a hardware platform; the parameters of the whole network model are mainly concentrated in the residual layers, the parameter of the single residual layer accounts for about 17.6% in the whole network model structure, and the parameter of the five residual network parameters accounts for 88% in the whole network model, so that the network of the rapid style migration model is compressed, the calculation mode of the residual network in the image generation network is improved, the convolution layers of two layers 3x3 are changed into convolution layers of two layers 1x1 and 3x3, the improved network is shown on the right side of fig. 2, the left side of fig. 2 is the existing residual network structure, and the formula is:

P＝N*C*3*3 (1)

in formula (1), P is the total number of parameters, N is the number of convolution kernels, i.e., the number of output channels, and C is the number of input channels; if the number of input channels is 256, the convolution kernel size is 3x3, the number of output channels is 256, the total number of parameters of the whole unit structure calculated to be 3x3x256x256x 2= 1179648, the whole structure is optimized by improving a residual block, the number of input channels input to the convolution kernel of 3x3 is reduced, the new structure can reduce 256 dimensions to 64 dimensions under one 1x1 convolution layer, then the convolution layer of 1x1 rises to 256 dimensions, the calculated amount is reduced while the precision is maintained, the number of parameters is 1x64x256+3x 64x64+1x64x256= 69632, and the parameters of each unit are reduced by about 94%.

(2) A rapid style migration network based on dual attention mechanisms;

after the content features and the style features are fused through self-adaptive normalization, the picture is input into a dual attention module, the dual attention module can effectively combine the picture space information and the channel information, as shown in fig. 7, the position attention module selectively gathers each position feature by calculating a weighted sum of the fused features of the whole space, and the similar features can be related no matter how far or near; the channel attention module selectively emphasizes the channel graphs related to each other by integrating the related features in all the fusion channel graphs, and finally adds the strengthening results of the two attention modules to further improve the picture effect.

Through normalization, the f of primary fusion is obtained _cs First f is carried out _cs Input into a convolution layer with the convolution kernel size of 1*1 for compression to obtain a matrix f ¹ 、f ² 、f ³ Then f ¹ Transpose, and sum f ² Multiplying the matrix to obtain the associated intensity matrix between any two point features, and normalizing by softmax operation to obtain the attention map of each position to other positions

Wherein i, j is the image pixel position.

Position ofAttention profile K ^s The calculation formula of (a) is shown as formula (2), A is firstly carried out _ji ^s Transpose, and then sum f ³ _i And performing matrix multiplication operation, and finally performing point operation with the original feature points of the corresponding positions.

Wherein K is ^s Is a position attention profile; θ ₁ The weight coefficient is initially 0; f (f) _i ³ Is the line i feature of position attention;

is a location attention mask; f (f) _cs Is a preliminary fusion feature map.

The channel attention module is similar to the position attention module, and performs dimension transformation and matrix multiplication on any two channel characteristics when the channel attention module obtains the characteristic attention, so as to obtain the association strength of any two channels, and a channel attention characteristic diagram K ^c The calculation formula is formula (3).

Matrix addition operation is carried out on corresponding points of the feature graphs output by the two modules, and a fusion feature graph F is obtained _cs As in equation (4), the weight of the important feature points is enhanced.

Wherein θ ₂ The weight coefficient is initialized to 0, and larger weight is obtained through gradual learning in the training process,

is a channel attention mask, f _i ^3t By combining f _cs The compressed data is input into a convolution layer with the convolution kernel size of 1x1 to be compressed;

/>

(3) Building a hardware circuit;

the system structure mainly comprises two parts: the ARM CPU-based software processing system and the FPGA-based programmable hardware logic circuit exert the performance advantages of large-scale parallel computation of the FPGA, and the ARM check network is used for flexible configuration; firstly, carrying out functional division on software and hardware according to the calculated amount and complexity, deploying a task with intensive calculation to an FPGA end, deploying a control task with a core to an ARM end, carrying out hardware control instruction sending and register configuration on a PS end, preparing and preprocessing data, analyzing a network model and extracting parameters, inputting the control parameters by a PL end, and simultaneously completing calculation, thereby improving the operation efficiency and reducing the power consumption; after weight data and bias data generated by pre-training a convolutional neural network to be built are obtained, the PS end reads in the trained model and test data, and the PL end realizes forward propagation calculation.

Fig. 3 is a system overall hardware architecture, an AXI4 bus is responsible for data transmission, an axi_lite bus is responsible for signal transmission, and an ARM is an external control unit and controls an internal register. The FPGA system mainly comprises a register module, a data interface module, a convolution module, a pooling module and a control module, wherein the register comprises a memory address and network layer parameters, the data interface module is used for completing data conversion of an accelerator and an external internal size interface, the convolution module is used for completing acceleration operation of a convolution layer and a full connection layer, the pooling module is used for completing pooling operation, and the control module is used for completing scheduling of each module.

Fig. 5 is a convolution module comprising: the system comprises a DMA module, an on-chip cache module, a logic control module, a convolution calculation module and a read-write address module. The DMA module is responsible for high-efficiency transmission of data between the acceleration circuit and the external memory; the on-chip cache module stores the characteristic and weight data, and the read-write address module calculates the characteristic weight address in the BRAM; the convolution calculation module is responsible for multiply-accumulate calculation, and the logic control module performs scheduling of each module. And the AXI_lite bus carries out signal transmission of PS and PL ends, the on-chip buffer writes register data corresponding to each convolution layer into the hardware accelerator, and the PL end hardware is scheduled to calculate the convolution layers.

FIG. 4 is a flow chart of data flow control for a convolution module, comprising:

the convolution module adopts a multiply-accumulate tree structure, data are stored in the buffer unit after being calculated in blocks, the parallelism is high, the performance is good, and the resource utilization rate is improved;

the convolution calculation calculates an input 4 channel and an output 4 channel simultaneously, the convolution kernel size is 3*3, 36 pieces of characteristic data are read every clock period, 4 parts are copied, 4x 3 input from an on-chip cache are read, namely 144 weight data are read, and a calculation result is given after a multiplication accumulator;

the method comprises the steps of calculating characteristic weights in real time, writing the characteristic weights into a read-write address module, giving DMA control DDR, storing the read characteristics and weights into BRAM after calculation by the DMA, storing the results into the BRAM, calculating a read-out address of the BRAM, carrying out padding zero-filling, and giving data to a convolution calculation module; and the whole design pipeline operates, internal data is regularly and movably transmitted, and accurate calculation of the data is ensured.

FIG. 6 is a fully connected module architecture, the overall design employing pipelining, control modules and computation modules through a handshake protocol to ensure normal transmission of data; the data in the convolution module is transmitted to an external DDR through an AXI4 through a data interface module; the input feature map and the weight are transmitted into the on-chip cache module through DMA; the full-connection layer has few features and more weights, the full-connection layer of the first layer has weight parameters of 98M, the features are all cached in BRAM, DDR reads the weights and the features in BRAM to calculate.

The pooling module uses an AXI bus to transmit input and output data, uses an AXI_lite bus to transmit the pooling parameters of the register signal pooling module to be configured through a register, and the DDR transmits the pooling parameters to the pooling module and then writes the data into the DDR, so that the requirement of the module on buffering can be reduced, and the use of resources is reduced; the design circuit of the invention firstly transversely pools the larger size, then longitudinally pools the larger size, and stores the adjacent data into the on-chip cache after the adjacent data are stronger.

The logic control module schedules each module to operate according to a start signal in the register, wherein the start signals are conv_valid and pool_valid, and the start of the convolution module and the pooling module are respectively controlled; after the pooling module starts to work, transmitting the processed data to the DDR, judging whether pooling is finished or not by the control module according to the returned finishing signal, pulling up a pool_fin signal after pooling is finished, and transmitting the signal to a register; when the convolution module starts to operate, the dat_run and wt_run signals separately control the feature map multiplexing and the weight multiplexing, and after one convolution is completed, if the dat_run and wt_run signals are valid, the address is updated,

multiplexing weights, and calculating and reading updated feature map data next time, wherein the weight edge is used last time; the counter calculates the number of updates and sends a completion signal when all convolution calculations are completed.

The data path adopts a hierarchical buffer and double buffer design, so that the access to the off-chip memory is reduced, the working efficiency is improved, and the data transmission among the internal modules adopts the staged multiplexing data, so that the access to the internal data is reduced. The input and output channels are calculated in parallel, mapped into the multiply-add array, and simultaneously fused with the multiply-add tree and the two-dimensional multiply-add array of the pulsation array, and a 6-stage pipeline structure is introduced, so that the resource utilization and the acceleration performance can be considered.

Compared with the prior art, the invention is realized by FPGA hardware, has the advantages of extremely small area, higher speed and lower power consumption, is suitable for equipment with limited resources, and has good application prospect.

With the above-described preferred embodiments according to the present invention as an illustration, the above-described descriptions can be used by persons skilled in the relevant art to make various changes and modifications without departing from the scope of the technical idea of the present invention. The technical scope of the present invention is not limited to the description, but must be determined according to the scope of claims.

Claims

1. The rapid style migration method based on the image is characterized by comprising the following steps of:

2. The rapid style migration method based on images according to claim 1, wherein: the improved residual network reduces 256 dimensions to 64 dimensions under one 1x1 convolution layer first, and then increases the dimensions to 256 through the 1x1 convolution layer, which improves the residual network to reduce the computational effort while maintaining accuracy.

3. The rapid style migration method based on images according to claim 1, wherein: the improved dual attention mechanism comprises a position attention module and a channel attention module, wherein the position attention module selectively gathers each position feature no matter how far or near the position feature is, and the similar features are related to each other by calculating a weighted sum of fusion features of the whole space; the channel attention module selectively emphasizes the channel graphs related to each other by integrating the related features in all the fusion channel graphs, and finally adds the enhancement results of the two attention modules.

4. The rapid style migration method according to claim 3, wherein the feature map formula of the location attention module is:

is a location attention mask; f (f) _cs Is a preliminary fusion feature map.

5. The rapid style migration method according to claim 3, wherein the feature map formula of the channel attention module is:

wherein θ ₂ Weight coefficient;

6. The rapid style migration method according to claim 3, wherein the two attention module strengthening results are added to obtain a fusion feature map:

7. An FPGA system based on rapid style migration of images, comprising: the device comprises a register module, a data interface module, a convolution module, a pooling module and a control module, wherein the register comprises a memory address and network layer parameters, the data interface module is used for completing data conversion of an accelerator and an external internal size interface, the convolution module is used for completing acceleration operation of a convolution layer and a full connection layer, the pooling module is used for completing pooling operation, and the control module is used for completing scheduling of each module.

8. The FPGA system of claim 7, wherein the convolution module comprises: the system comprises a DMA module, an on-chip cache module, a logic control module, a convolution calculation module and a read-write address module, wherein the DMA module is responsible for high-efficiency transmission of data between an acceleration circuit and an external memory; the on-chip cache module stores the characteristic and weight data, and the read-write address module calculates the characteristic weight address in the BRAM; the convolution calculation module is responsible for multiply-accumulate calculation; the logic control module performs scheduling of each module; and the AXI_lite bus carries out signal transmission of PS and PL ends, the on-chip buffer writes register data corresponding to each convolution layer into the hardware accelerator, and the PL end hardware is scheduled to calculate the convolution layers.

9. The FPGA system of claim 8, wherein the control method of the convolution module comprises:

10. The FPGA system of claim 7, wherein the pooling module uses an AXI bus to transmit input and output data, and wherein the axi_lite bus is configured to transmit pooling parameters via a register, and wherein the DDR transmits the pooling parameters to the pooling module and then writes the data to the DDR.