WO2023115814A1

WO2023115814A1 - Fpga hardware architecture, data processing method therefor and storage medium

Info

Publication number: WO2023115814A1
Application number: PCT/CN2022/095365
Authority: WO
Inventors: 曹其春; 董刚; 胡克坤; 杨宏斌; 尹文枫; 王斌强
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2021-12-22
Filing date: 2022-05-26
Publication date: 2023-06-29
Also published as: CN113963241A; CN113963241B

Abstract

The present application relates to an FPGA hardware architecture, a data processing method and apparatus, a computer device, and a storage medium. The method comprises: acquiring first picture feature data to be processed; inputting the first picture feature data into a pedestrian re-identification network model to obtain classified identification information, a structural block of the pedestrian re-identification network model comprising a forward hierarchical connection group, a backward hierarchical connection group and a channel scale selection module which are connected in sequence, each of the forward hierarchical connection group and the backward hierarchical connection group comprising multiple first structural units, a first structural unit comprising a first 1x1 convolutional network, a first batch normalization network, a first translation network, a second 1x1 convolutional network, a second batch normalization network and a linear activation function network which are connected in sequence, and the channel scale selection module comprising a summation unit.

Description

FPGA hardware architecture and its data processing method and storage medium

Cross References to Related Applications

This application claims the priority of the Chinese patent application with the application number 202111579432.9 and the application title "FPGA hardware architecture and its data processing method, storage medium" submitted to the China Patent Office on December 22, 2021, the entire contents of which are incorporated by reference in this application.

technical field

The present application relates to the technical field of FPGA hardware processing, in particular to an FPGA hardware architecture and a data processing method, device, computer equipment and storage medium.

Background technique

The current hardware architecture based on FPGA-based neural network acceleration is mainly focused on improving the following performance: FPGA (Field Programmable Gate Array, Field Programmable Gate Array) computing power, network accuracy, and network model size. The FPGA hardware architecture almost includes on-chip cache, convolution acceleration module, pool (pooling) module, load (loading) module, save (storage) module, and instruction control module. The FPGA hardware architecture is not too difficult to implement, but the software compilation is relatively difficult to implement.

Software compilation needs to adapt to different network models and be compatible with changes in FPGA hardware. At the same time, it needs to provide users with an easy-to-operate interface. It is still difficult to achieve these at the same time in the current situation. The reason is: there are too many changes in the FPGA hardware architecture, and the configurable parameters of each module of the FPGA hardware architecture can be changed based on requirements, such as changes in the parallel number of convolution modules. In addition, there are various network models and many open source network model platforms, which lead to the diversification of FPGA hardware architecture.

The inventor realized that due to the coordination problem of software and hardware, when using the FPGA hardware architecture to realize the accelerated processing of image data by the neural network, if the software settings are more complicated, the FPGA hardware architecture based on the software settings will also be more complicated. Therefore, how to design software and hardware together and simplify the FPGA hardware architecture has become an urgent problem to be solved.

Contents of the invention

A data processing method applied to an FPGA hardware architecture, comprising: acquiring first image feature data to be processed; inputting the first image feature data into a pedestrian re-identification network model based on contextual multi-scale feature learning to obtain a pedestrian re-identification network model The output classification and identification information; among them, the structural blocks of the pedestrian re-identification network model include sequentially connected forward layered connection group, backward layered connection group and channel scale selection module, forward layered connection group and backward layered Each connection group contains a plurality of first structural units, and the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, and the second batch normalization network connected in sequence As well as the linear activation function network, the summation unit is included in the channel scale selection module.

In one of the embodiments, the forward hierarchical connectome is used to perform step-by-step inter-scale information fusion on the feature data of the first picture through multiple first structural units of the forward hierarchical connectome, and the backward hierarchical connectome is used to In order to perform cross-scale information fusion on the information output by each first structural unit in the forward hierarchical connection group through the multiple first structural units of the backward hierarchical connection group, the channel scale selection module is used to perform the cross-scale information fusion through the summation unit The information output by each first structural unit in the hierarchical connection group is summed to obtain classification identification information.

In one of the embodiments, the pedestrian re-identification network model further includes a 3x3 convolutional network module, and the 3x3 convolutional network module is connected to the forward layered connection group; the 3x3 convolutional network module is used to adopt multiple separable 3x3 convolutional networks The feature data of the first picture is processed to obtain the feature data of the second picture; the forward hierarchical connection group is used to fuse information between the scales of the second picture feature data step by step through a plurality of first structural units.

In one of the embodiments, the pedestrian re-identification network model also includes a translation convolution module, and the translation convolution module is connected to the forward layered connection group; The feature data is processed to obtain the feature data of the third picture; the forward hierarchical connection group is used to fuse the feature data of the third picture step by step through multiple first structural units.

In one of the embodiments, the translational convolution module includes a second structural unit with the same structure as the first structural unit, and the translational convolution module is used to perform the first picture feature data through one or a plurality of second structural units connected in sequence Processing, after processing, the feature data of the third picture is obtained.

In one embodiment, the translational convolution module further includes a pooling unit, and the pooling unit is located between the first second structural unit and the second second structural unit among the plurality of sequentially connected second structural units.

In one embodiment, the feature data of the first picture is the feature data obtained after processing the original picture data with the quantization algorithm of the arbitrary bit quantization network DoReFa-Net.

A data processing device applied to an FPGA hardware architecture, comprising: an acquisition module for acquiring feature data of a first picture to be processed; a processing module for inputting the feature data of the first picture into a pedestrian weighting system based on contextual multi-scale feature learning Identify the network model to obtain the classification and recognition information output by the pedestrian re-identification network model; wherein, the structural blocks of the pedestrian re-identification network model include sequentially connected forward hierarchical connection groups, backward hierarchical connection groups, and channel scale selection modules. Both the layered connection group and the backward layered connection group contain a plurality of first structural units, and the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, second batch normalization network, and linear activation function network, with a summation unit included in the channel scale selection module.

A kind of FPGA hardware architecture, it is characterized in that, FPGA hardware architecture comprises central processing unit, memory, computing unit processing part, pooling part and residual part, controller; Central processing unit is used for receiving the picture feature data to be processed, and will The picture feature data is stored in the memory; one or more arithmetic logic unit matrices are arranged in the calculation unit processing part, and the calculation unit processing part is used to read the picture feature data from the memory, and perform 1x1 on the picture feature data through the arithmetic logic unit matrix Convolution processing; the pooling component is used for pooling the picture feature data output by the computing unit processing component; the residual component is used for the picture feature data output by the pooling component and/or the picture feature data output by the computing unit processing component Carry out residual accumulation processing; the controller is used to control the computing unit processing part to read the picture feature data, weight data and translation parameters from the memory according to the pedestrian re-identification network model to perform 1x1 convolution processing, and control whether the pooling part performs calculation The feature data output by the unit processing component is pooled, and whether the residual component reads the picture feature data from the memory and whether the picture feature data input to the residual component is subjected to residual processing; among them, the pedestrian re-identification network model The structural block includes sequentially connected forward hierarchical connectome, backward hierarchical connectome and channel scale selection module. Both the forward hierarchical connectome and the backward hierarchical connectome contain multiple first structural units. A structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch normalization network and the linear activation function network connected in sequence, and the channel scale selection module contains the calculation and unit.

A computer device comprising a memory and one or more processors, wherein computer readable instructions are stored in the memory, and when executed by the one or more processors, the one or more processing The processor executes the steps of any one of the above-mentioned data processing methods applied to the FPGA hardware architecture.

One or more non-volatile computer-readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform any of the above-mentioned The item is applied to the steps of the data processing method of the FPGA hardware architecture.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the application will be apparent from the description, drawings, and claims.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

In order to more clearly illustrate the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, for those of ordinary skill in the art, In other words, other drawings can also be obtained from these drawings without paying creative labor.

Fig. 1 is a frame structure block diagram of a kind of FPGA hardware architecture provided by the present application according to one or more embodiments;

FIG. 2 is a schematic flow diagram of a data processing method applied to FPGA hardware architecture provided by the present application according to one or more embodiments;

FIG. 3 is a network block diagram of an existing pedestrian re-identification network model based on contextual multi-scale feature learning provided by the present application according to one or more embodiments;

FIG. 4 is a network block diagram of the internal model of DW Conv provided by the present application according to one or more embodiments;

FIG. 5 is a network block diagram of an improved pedestrian re-identification network model based on contextual multi-scale feature learning provided by the present application according to one or more embodiments;

FIG. 6 is an example diagram of convolution calculation of a translation operation provided by the present application according to one or more embodiments;

Fig. 7 is an example diagram of convolution calculation of translation operation by means of average grouping provided by the present application according to one or more embodiments;

Fig. 8 is a schematic diagram of the dx and dy position identification of the convolution kernel provided by the translation operation according to one or more embodiments of the present application;

FIG. 9 is an example diagram of convolution calculation of a translation operation with a convolution kernel of 5 provided by the present application according to one or more embodiments;

FIG. 10 is an example diagram of 61 offset parameter coordinates provided by the present application according to one or more embodiments;

FIG. 11 is an example diagram of an 8-bit fixed point of arbitrary bit quantization DoReFa-Net provided by the present application according to one or more embodiments;

FIG. 12 is a specific hardware structural diagram of a FPGA hardware architecture provided by the present application according to one or more embodiments;

FIG. 13 is a structural block diagram of a data processing device applied to FPGA hardware architecture provided by the present application according to one or more embodiments;

Fig. 14 is an internal structure diagram of a computer device provided by the present application according to one or more embodiments.

Detailed ways

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

A data processing method applied to FPGA hardware architecture provided by the present application can be applied to FPGA hardware architecture as shown in FIG. 1 . Wherein, as shown in FIG. 1 , the FPGA hardware architecture includes a central processing unit, a memory, a computing unit processing unit, a pooling unit, a residual unit, and a controller. The central processing unit is connected to the memory, and when receiving the first picture characteristic data to be processed, stores the first picture characteristic data in the memory. The computing unit processing unit is mainly used to process the 1x1 convolutional network. The arithmetic logic unit matrix is set in the calculation unit processing part, and the matrix size of the arithmetic logic unit matrix can be set according to different hardware resource conditions. The calculation unit processing part reads the feature data of the first picture from the memory, performs 1x1 convolution processing on the feature data of the first picture through the ALU matrix, and outputs the information of the convolution processing result to the pooling part. The pooling component is used to perform pooling processing on the information output by the computing unit processing component. The residual components are respectively connected to the pooling component and the computing unit processing component, and can perform residual accumulation processing on the information output by the pooling component, or perform residual cumulative processing on the information output by the computing unit processing component, or can perform pooling The information output by the component and the calculation unit process the information output by the component to perform residual accumulation processing. The controller is used to control the computing unit processing unit to read the first picture feature data, weight data and translation parameters from the memory according to the pedestrian re-identification network model, so as to perform 1x1 convolution processing, and control whether the pooling unit outputs to the computing unit processing unit The feature data of the residual component is pooled, and whether the residual component reads other image feature data from the memory and whether to perform residual processing on the image feature data input to the residual component; among them, the structural block of the pedestrian re-identification network model Including the sequentially connected forward hierarchical connectome, backward hierarchical connectome and channel scale selection module, both the forward hierarchical connectome and the backward hierarchical connectome contain a plurality of first structural units, the first structural unit It includes the first 1x1 convolution network, the first batch normalization network, the first translation network, the second 1x1 convolution network, the second batch normalization network and the linear activation function network connected in sequence, and the channel scale selection module includes a summation unit. Therefore, the controller can control other components according to a person re-identification network model applied to the data processing method of the FPGA hardware architecture to implement a data processing method of the present application applied to the FPGA hardware architecture.

A data processing method applied to the FPGA hardware architecture of the present application adopts the FPGA hardware architecture as shown in FIG. 1 . Specifically, the first picture feature data is read from the memory by the computing unit processing component. Realize the correlation operation of the first structural unit in the forward hierarchical connection group and the backward hierarchical connection group in the pedestrian re-identification network model through the arithmetic logic unit matrix in the calculation unit processing part, and realize the channel scale selection module through the residual part The correlation operation of the summation unit in . When a data processing method applied to the FPGA hardware architecture requires pooling processing, the pooling component can also be used to perform pooling processing on the information output by the computing unit processing component, and then the residual component can be used to perform residual accumulation processing . It can be seen that the present application is a data processing method applied to the FPGA hardware architecture. When using the FPGA hardware architecture to realize image data processing, it can simplify the hardware requirements for the FPGA hardware architecture and realize the use of fewer resources to realize the FPGA hardware architecture. Network acceleration.

In one embodiment, as shown in Figure 2, a kind of data processing method that is applied to FPGA hardware framework is provided, is applied to the FPGA hardware framework in Figure 1 with this method as example to illustrate, comprises the following steps:

S202. Acquire feature data of the first picture to be processed.

In this embodiment, a data processing method applied to an FPGA hardware architecture is used to accelerate processing of image data based on a neural network of the FPGA hardware architecture, so as to quickly identify classification information of an image. In this embodiment, the acquired first picture feature data is the picture feature data obtained after feature data extraction of the picture to be processed.

S204. Input the feature data of the first picture into the pedestrian re-identification network model based on contextual multi-scale feature learning, and obtain classification and recognition information output by the pedestrian re-identification network model.

Among them, the structural blocks of the pedestrian re-identification network model include sequentially connected forward hierarchical connectome, backward hierarchical connectome and channel scale selection module. Both forward hierarchical connectome and backward hierarchical connectome contain multiple A first structural unit, the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch normalization network and the linear activation function network connected in sequence, A summation unit is included in the channel scale selection module.

The forward hierarchical connection group is used to perform step-by-step inter-scale information fusion on the feature data of the first picture through multiple first structural units of the forward hierarchical connection group, and the backward hierarchical connection group is used to The multiple first structural units in the group perform cross-scale information fusion on the information output by each first structural unit in the forward hierarchical connection group, and the channel scale selection module is used to use the summation unit to perform cross-scale information fusion on the information output by each first structural unit in the backward hierarchical connection group. The information output by a structural unit is summed to obtain classification identification information.

In this embodiment, the pedestrian re-identification network model based on contextual multi-scale feature learning is used to process the feature data of the first picture to obtain the classification and identification information of the original picture corresponding to the feature data of the first picture. What needs to be explained here is that the pedestrian re-identification network model based on contextual multi-scale feature learning is obtained by improving the network model of the existing pedestrian re-identification network model based on contextual multi-scale feature learning. The following describes the existing pedestrian re-identification network model based on contextual multi-scale feature learning:

The pedestrian re-identification network model, referred to as CMSNet (Contextual Multi-Scale Feature Learning for Person Re-Identification), is a contextual multi-scale network, which is used to simultaneously learn public and contextual multi-scale representations. As shown in Figure 3, the building blocks of CMSNet obtain contextual multi-scale representations through bidirectional hierarchical connectomes, which include forward hierarchical connectomes for stepwise inter-scale information fusion and backward hierarchical connectomes for cross-scale information fusion. layer connection group. As shown in Figure 3, HFCG stands for Forward Hierarchical Connectgroup and BHCG stands for Backward Hierarchical Connectgroup. In addition, as shown in FIG. 3 , CMSNet also includes a CSS (Channel-Wise Scale Selection module, channel-wise scale selection module) structure. Operations such as b-softmax, fullconnected, and matrix product are used in the CSS structure. Among them, as shown in Figure 3, Conv 1x1 represents a 1x1 convolutional network, and the internal model of DW Conv is shown in Figure 4. In Figure 4, GConv 3x3 represents a separable 3x3 convolution, BN (Batch Normalization) represents a batch normalization network, and ReLU represents a linear activation function network. As shown in FIG. 3 , b ₁ , b ₂ , b ₃ , and b ₄ respectively identify the output information of the corresponding DW Conv. AvgPod represents the averaging node.

Combining Figure 3 and Figure 4, it can be seen that the existing pedestrian re-identification network model based on contextual multi-scale feature learning uses separable 3x3 convolution and uses b-softmax, fullconnected, matrix product and other operations in the CSS structure. The calculation method and resource occupation of these modules are considered when the hardware of the FPGA hardware architecture is used, which will greatly increase the complexity of the hardware design.

In this implementation, improvements are made based on the above-mentioned existing pedestrian re-identification network model based on contextual multi-scale feature learning. As shown in Figure 5, the improved pedestrian re-identification network model based on contextual multi-scale feature learning also includes three modules: forward hierarchical connection group HFCG, backward hierarchical connection group BHCG and channel scale selection module CSS. However, the improved forward hierarchical connection group HFCG and backward hierarchical connection group BHCG contain convolutional units no longer DW Conv, but the first structural unit, the CSC unit. As shown in Figure 5, the CSC unit includes Conv 1x1, BN, Shift, and ReLU. Among them, Shift means translation operation, Conv 1x1 means 1x1 convolutional network, BN means batch normalization network, and ReLU means linear activation function network. The improved channel scale selection module CSS only includes the summation unit Sum.

Comparing Figure 3 and Figure 5, it can be seen that in this application, the network structure of the optimized pedestrian re-identification network model is simple, including only Conv1x1 convolution, pooling, and residuals, and only a few settings are required in the hardware design of the FPGA hardware architecture. It can be seen that the structure of the hardware is simpler, and the data can also achieve the maximum flow between the hardware modules, so that the data can be maximized on the hardware. utilization rate.

Therefore, the above-mentioned FPGA hardware architecture and data processing method can simplify the hardware configuration of the FPGA hardware architecture when processing image feature data, cooperate with software and hardware, and realize network acceleration of the FPGA hardware architecture with fewer resources.

In one embodiment, the pedestrian re-identification network model further includes a 3x3 convolutional network module, and the 3x3 convolutional network module is connected to the forward layered connection group; the 3x3 convolutional network module is used to adopt a plurality of separable 3x3 convolutional network pairs The feature data of the first picture is processed, and the feature data of the second picture are obtained after processing; the forward hierarchical connection group is used to fuse the feature data of the second picture step by step through a plurality of first structural units.

In this embodiment, the extracted picture feature data is obtained after picture feature extraction is performed on the original picture. The existing pedestrian re-identification network model uses a 7x7 convolutional network to process image feature data. The improved pedestrian re-identification network model of this application uses multiple separable 3x3 convolutional networks to process the feature data of the first picture, and obtain the feature data of the second picture after processing. Therefore, this application replaces the 7x7 convolutional network in the original pedestrian re-identification network model with three separable 3x3 convolutional networks. Since there are separable 3x3 convolutional networks in the pedestrian re-identification network model based on contextual multi-scale feature learning, multiple separable 3x3 convolutional networks are used to process the feature data of the first picture, and then the feature data of the second picture obtained Input the pedestrian re-identification network model, so there is no need to process the characteristic data of the 7x7 convolutional network characteristics in the pedestrian re-identification network model, only need to pay attention to the characteristic data of the separable 3x3 convolutional network characteristics, which can reduce the processing of the pedestrian re-identification network model To improve the processing efficiency of the pedestrian re-identification network model for feature data.

In one embodiment, the pedestrian re-identification network model further includes a translational convolution module, and the translational convolutional module is connected to the forward layered connection group; the translational convolutional module is used to perform a translation operation and a 1x1 convolutional network on the first picture feature The data is processed to obtain the feature data of the third picture after processing; the forward hierarchical connection group is used to fuse the feature data of the third picture step by step through multiple first structural units.

In this embodiment, the feature data of the first picture is processed through a translation operation and a 1x1 convolutional network. The 7x7 convolutional network in the original pedestrian re-identification network model is replaced by the translation operation and the 1x1 convolutional network, or the separable 3x3 convolutional network in the above-mentioned embodiment can be replaced. This replacement can reduce the migration of feature map featuremap data. For example, in a 3x3 convolutional network, each image feature data needs to be used three times, while in a 1x1 convolutional network, the image feature data only needs to be moved once, thus reducing the logic complexity and improving the calculation speed of the pedestrian re-identification network model. For example, in the 7x7 convolutional network in the original pedestrian re-identification network model, each image feature data needs to be used 7 times, while the image feature data in the 1x1 convolutional network only needs to be moved once, thus reducing the logic complexity and improving the efficiency. The computational speed of the pedestrian re-identification network model. Among them, the translation operation is to move a certain range of pixels of the image feature data to the middle as a result, and such an operation reduces the number of multiplication operations. This replacement will result in reduced precision, but can reduce the number of operations of the FPGA hardware architecture and simplify the network structure design of the FPGA hardware architecture. The translation operation and 1x1 convolutional network are described in detail below:

The convolution process of the translation operation is equivalent to translating the original input matrix in a certain direction, as shown in Figure 6. Although the simple translation operation does not seem to extract spatial information, considering that the channel domain is the hierarchical diffusion of spatial domain information, by setting the convolution kernels of translation operations in different directions, the input image feature data (such as separable The tensor of the second picture feature data output by the 3x3 convolutional network is translated in different channels, and then combined with the 1x1 convolutional network to achieve cross-channel information fusion, the information extraction in the spatial domain and the channel domain can be realized.

A convolution kernel for translation operation on each channel channel

(

represents an integer) possible translation directions, and assuming that there are M channels, so there are

possible translation options. Obviously, it is unrealistic to violently search for the most suitable translation option in such a space. So, divide the M channels into

Each group is called a shift group. then each set of

Channels use the same translation selection, using the same translation direction. Of course, there may be inexhaustible situations. At this time, there will be some channels that cannot be divided into any group. These channels are called "centered" groups, and the "centered" group does not perform translation operations. The input is grouped by the number of channels, and each group of channels only translates in one direction.

For example, the convolution kernel is 3x3, and then there are 64 channels in total. According to the above method, these channels are divided into 9 groups, each group has 7 channels, and the remaining channel does not perform translation operations. These 9 groups need to be assigned to a certain translation group in sequence, and the 7 channels of each group share a translation parameter. The 3x3 convolutional network in separability is replaced by a translation operation and a 1x1 convolutional network. The translation operation uses average grouping, and each group is assigned to a translation group in order. The dx and dy of the x-axis and y-axis of the translation are controlled at [-1 ,1] range, a simple description is shown in Figure 7. Considering that Conv1x1 and the translation operation shift can be combined, in order to reduce the single module of the translation operation shift on the FPGA hardware architecture, the hardware design can be more streamlined, and the dx and dy position identification of the translation operation shift convolution kernel are shown in Figure 8 .

As shown in Figure 9, if the convolution kernel of the translation operation kernel=5, 5x5=25 kinds of offsets will be generated, and dx and dy are in the range of [-3,3], so the channels that share a set of offset parameters will be There are fewer possible results, and the performance of the network of the person re-identification network model can be further improved.

If the convolution kernel of the translation operation kernel=9, 9x9=81 kinds of offset parameters will be generated, dx and dy are in the range of [-4,4], and the appropriate [dx,dy] coordinates are reserved in the scope. The Gaussian filter algorithm is used in these offset parameters, as follows:

Do normalization processing to generate the probability value of each position, use the numpy.random.choice function to generate a random offset position [dx, dy] for each channel, and generate a translation position offset type. Set receptive_field_radius=4.25, and there are 61 offset parameter coordinates that meet the conditions, as shown in Figure 10 .

In one embodiment, the above-mentioned translational convolution module includes a second structural unit having the same structure as the first structural unit, and the translational convolution module is used to process the feature data of the first picture through one or a plurality of second structural units connected in sequence Processing, after processing, the feature data of the third picture is obtained.

Wherein, the translational convolution module further includes a pooling unit, and the pooling unit is located in the first one of the second structural units and the second one of the second structural units among the sequentially connected multiple second structural units. between structural units.

In this embodiment, the above-mentioned translational convolution module is used to process the feature data of the first picture through a translation operation and a 1x1 convolutional network. The first picture feature data may be processed by using a second structural unit having the same structure as the first structural unit. The second structural unit includes the sequentially connected first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch normalization network, and a linear activation function as in the first structural unit network. The first picture feature data can be processed through the translation operation and the 1×1 convolutional network through the second structural unit. In addition, this application improves on the existing pedestrian re-identification network model. If the convolution step size of the first layer of 7x7 convolutional network in the existing pedestrian re-identification network model is 2, the improved pedestrian re-identification network of this application In the model, a pooling unit is added between the first second structural unit and the second second structural unit among the sequentially connected multiple second structural units. The pooling unit can be a unit for maximum pooling or an average pooling unit. Therefore, in order to keep the size of the intermediate feature map of the entire network consistent with the original network, a maximum pooling layer is added after the first CSC structure. The details are shown in the table below.

Specifically, this application optimizes the CMSNet network structure in order to be compatible with the algorithm structure and hardware characteristics, and maximizes the cost performance of software and hardware within a reasonable range of network performance. The optimized network structure is as follows:

In the above table, Input represents the input, Layer represents the network level of the CMSNet network, Cout represents the number of output channels, Kernel represents the convolution kernel, Stride represents the convolution step size, Number represents the number of repetitions, Params represents the parameter, and Flops represents the number of floats per second. The number of point operations; CSC represents the first structural unit of the above CSC, max pool represents the maximum pooling, CMS block represents the entire network module in Figure 5, conv2d represents two-dimensional convolution, average pool represents average pooling, and global average pool represents the global Average pooling, fc means full connection.

As can be seen from the above table, to optimize some CMSNet network structures, in addition to using CSC convolution units, the CMS Block structure is also optimized. Among them, the CMS Block structure is shown in Figure 5. Comparing Figure 3 and Figure 4, it can be seen that in the optimized CMS Block structure, DWConv is replaced with a CSC structure, and the CSC structure is composed of Conv1x1+Shift+Conv1x1+BN+Relu. After training the CMSNet network structure, BN will be fused to Conv1x1 and solidified into the model file.

It can be seen from the above table that the optimized network structure contains multiple first structural units. When the second picture feature data is input to the pedestrian re-identification network model, as shown in the above table, through the sequential multiple first structural units in the pedestrian re-identification network model A structural unit processes the feature data of the second picture.

In one embodiment, the above-mentioned first picture feature data is feature data obtained after processing the original picture data by using an arbitrary bit quantization network DoReFa-Net.

The weight parameters in the network-optimized CMSNet network are only Conv1x1 parameters, and the structure is unified, which is more conducive to quantifying the network structure. In this embodiment, in order to further reduce the amount of network parameters, the quantization algorithm of the arbitrary bit quantization network DoReFa-Net is used to quantize the full-precision weights in the CMSNet network. Among them, it includes using an arbitrary bit quantization network to obtain the first image feature data from the original image data, and may also include using an arbitrary bit quantization network to quantize the first image feature data before inputting it into the pedestrian re-identification network model, and using an arbitrary bit quantization The network quantizes the feature data of the second picture before inputting it to the pedestrian re-identification network model, and uses an arbitrary bit quantization network to quantify the feature data of the third picture before inputting it to the pedestrian re-identification network model. Specifically, the quantization algorithm of the arbitrary bit quantization network DoReFa-Net is used to quantize the picture feature data input to the pedestrian re-identification network model for convolution processing.

In this embodiment, 8-bit fixed-point representation can be used for quantization processing, as shown in FIG. 11 , before quantization to k=8 bits, the hyperbolic tangent function tanh is used to limit the weight between [-1, 1]. pass

Constrain the value between [0, 1], and the maximum value is relative to the weight of the entire layer. Then pass:

The floating-point number is converted to k=8-bit fixed-point number, and the range is [0, 1], and finally the weight is constrained to [-1, 1] through x0=2q-1 mapping transformation.

A data processing method applied to the FPGA hardware architecture of the present application, from the aspect of collaborative design of hardware and network, separates the neural network acceleration hardware design and network compression in the pedestrian re-identification network model, and compresses the network as much as possible Considering the characteristics of the hardware, the network model is more suitable for the hardware architecture. This application is based on the latest pedestrian re-identification network CMSNet network to design a network that can be applied to the FPGA hardware architecture. For the 7x7 convolution, b-softmax, fully-connected layer and other operations in the network structure of the pedestrian re-identification network CMSNet It is not easy to implement on hardware, and in the case of replacing and deleting operations with low frequency and optimizing the entire network structure to retain most of the accuracy, it is more conducive to hardware implementation and speeds up network reasoning. Specifically, a kind of data processing method that is applied to the FPGA hardware framework that the above-mentioned embodiment provides has the effect of the following aspects:

1. For the latest lightweight pedestrian re-identification network CMSNet, apply the idea of software and hardware co-design, use three separable 3x3 convolutions to replace ordinary 7x7 convolutions, reduce the amount of parameters, and remove ordinary 7x7 convolutions that are less frequently used , which is conducive to the realization of hardware; then use the translation operation and 1x1 convolutional network (Shift+Conv1x1) to replace the separable 3x3 convolutional network of the entire network; remove the CSS structure in the CMS Block, and only keep the SUM operation. Therefore, the optimized pedestrian re-identification network CMSNet only retains 1x1 convolution and other operations, which greatly simplifies the hardware design and can achieve network acceleration with fewer resources.

2. The integration of Conv1x1+Shift operation makes it unnecessary for the hardware to design a separate translation operation Shift module, simplifies the resource consumption of the hardware design translation operation Shift module, and reduces the transmission of data between modules.

3. For the network structure of the optimized pedestrian re-identification network, it only includes Conv1x1 convolution, pooling, and residual. In hardware design, several main modules are the calculation unit, pooling unit, and residual unit of Conv1x1 convolution. The structure of the hardware is simpler, and the data can also achieve the maximum flow between the hardware modules, so that the data can be maximized on the hardware.

It should be understood that although the various steps in the flow chart are displayed sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flowchart may include multiple sub-steps or multiple stages, these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, the execution of these sub-steps or stages The order is not necessarily performed sequentially, but may be performed alternately or alternately with at least a part of other steps or sub-steps or stages of other steps.

In one embodiment, the present application also provides an FPGA hardware architecture. As shown in FIG. 1 , the FPGA hardware architecture includes a central processing unit, a memory, a computing unit processing unit, a pooling unit, a residual unit, and a controller. The central processing unit is used to receive the picture feature data to be processed, and store the picture feature data into the memory; one or more arithmetic logic unit matrices are arranged in the computing unit processing part, and the computing unit processing part is used to read the picture feature data from the memory Data, and perform 1x1 convolution processing on the image feature data through the ALU matrix; the pooling component is used to perform pooling processing on the image feature data output by the computing unit processing component; the residual component is used for the image output by the pooling component The feature data and/or the picture feature data output by the calculation unit processing part are processed for residual accumulation; the controller is used to control the calculation unit processing part to read the picture feature data, weight data and translation parameters from the memory according to the pedestrian re-identification network model, to Perform 1x1 convolution processing, and control whether the pooling component performs pooling processing on the feature data output by the computing unit processing component, and control whether the residual component reads the image feature data from the memory and whether the image input to the residual component Residual processing is performed on the feature data; among them, the structural block of the pedestrian re-identification network model includes the forward hierarchical connection group, the backward hierarchical connection group and the channel scale selection module connected in sequence, the forward hierarchical connection group and the backward classification Each layer connection group contains a plurality of first structural units, and the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, and the second batch normalization Network and linear activation function network, the channel scale selection module contains a summation unit.

In other embodiments, the pedestrian re-identification network model may further include the modules or units described in the embodiments corresponding to the above-mentioned data processing method applied to the FPGA hardware architecture. For details, please refer to the descriptions of the above-mentioned embodiments.

Specifically, a specific hardware structure diagram of an FPGA hardware architecture is given, as shown in FIG. 12 . Wherein, in FIG. 12, CPU represents a central processing unit, and DDR/DRAM represents a double-rate synchronous dynamic random access memory/dynamic random access memory. When the specific hardware structure diagram of this FPGA hardware architecture executes the above-mentioned data processing method applied to the FPGA hardware architecture, the processing flow is as follows:

The central processing unit receives the image characteristic data to be processed, and stores the image characteristic data in a double-rate synchronous dynamic random access memory/dynamic random access memory. One or more arithmetic logic unit matrices are arranged in the calculation unit processing part, and the calculation unit processing part reads the picture feature data from the double-rate synchronous dynamic random access memory/dynamic random access memory, and uses the arithmetic logic unit matrix to compare the picture feature data The data is processed by 1x1 convolution. The pooling component performs pooling processing on the image feature data output by the computing unit processing component. The residual component performs residual accumulation processing on the picture feature data output by the pooling component and/or the picture feature data output by the calculation unit processing component. According to the pedestrian re-identification network model, the controller controls the processing unit of the calculation unit to read the picture feature data, weight data and translation parameters from the double-rate synchronous DRAM/DRAM to perform 1x1 convolution processing, and control the pool Whether the optimization component performs pooling processing on the feature data output by the computing unit processing component, and whether the control residual component reads the image feature data from the double-rate synchronous dynamic random access memory/dynamic random access memory and whether to input it to the residual The image feature data of the part is subjected to residual processing. Among them, the structural blocks of the pedestrian re-identification network model include sequentially connected forward hierarchical connectome, backward hierarchical connectome and channel scale selection module. Both forward hierarchical connectome and backward hierarchical connectome contain multiple A first structural unit, the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch normalization network and the linear activation function network connected in sequence, A summation unit is included in the channel scale selection module.

It can be seen that based on the above-mentioned data processing method applied to the FPGA hardware architecture, when using the neural network of the FPGA hardware architecture to accelerate the processing of image feature data, the hardware configuration requirements for the FPGA hardware architecture are simple, and less resources can be used. Implementing Network Accelerated Processing for FPGA Hardware Architecture

In one embodiment, the present application also provides a data processing device applied to FPGA hardware architecture. As shown in FIG. 13 , a data processing device applied to FPGA hardware architecture is provided, including an acquisition module 1302 and a processing module 1304 . The acquiring module 1302 is used to acquire the feature data of the first picture to be processed; the processing module 1304 is used to input the feature data of the first picture into the pedestrian re-identification network model based on contextual multi-scale feature learning, and obtain the output of the pedestrian re-identification network model Classification identification information; among them, the structural blocks of the pedestrian re-identification network model include sequentially connected forward hierarchical connection group, backward hierarchical connection group and channel scale selection module, forward hierarchical connection group and backward hierarchical connection group Each contains a plurality of first structural units, the first structural unit includes the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch normalization network and the linear Activation function network, the channel scale selection module contains a summation unit.

For a specific definition of a data processing device applied to an FPGA hardware architecture, reference may be made to the above definition of a data processing method applied to an FPGA hardware architecture, which will not be repeated here. Each module in the above-mentioned data processing device applied to the FPGA hardware architecture can be fully or partially realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure may be as shown in FIG. 14 . The computer device includes a processor, memory, network interface and database connected by a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer readable instructions in the non-volatile storage medium. The network interface of the computer equipment is used to connect with external equipment, so as to receive the information of external equipment. When the computer-readable instructions are executed by the processor, a data processing method applied to FPGA hardware architecture is implemented.

Those skilled in the art can understand that the structure shown in Figure 14 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied. The specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and operable on the processor. The processor executes the computer-readable instructions to implement any of the above-mentioned embodiments. The steps of the data processing method applied to the FPGA hardware architecture.

In one embodiment, there is also provided a non-volatile computer-readable storage medium, the non-volatile computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions are processed by one or more The steps of the data processing method applied to the FPGA hardware architecture in any one of the above embodiments can be implemented when the device is executed.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be completed by instructing related hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer In the readable storage medium, the computer-readable instructions may include the processes of the embodiments of the above-mentioned methods when executed. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

The above-mentioned embodiments only represent several implementation modes of the present application, and the description thereof is relatively specific and detailed, but it should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the scope of protection of the patent application should be based on the appended claims.

Claims

A data processing method applied to FPGA hardware architecture, characterized in that said method comprises:

Acquire the feature data of the first picture to be processed; and

Inputting the first picture feature data into a pedestrian re-identification network model based on contextual multi-scale feature learning, to obtain classification and identification information output by the pedestrian re-identification network model;

Wherein, the structural block of the pedestrian re-identification network model includes a sequentially connected forward layered connection group, a backward layered connection group, and a channel scale selection module, and the forward layered connection group and the backward layered Each connection group contains a plurality of first structural units, and the first structural units include the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch A standardized network and a linear activation function network, the channel scale selection module includes a summing unit.
The method according to claim 1, wherein the forward hierarchical connection group is used to perform the first picture feature data through a plurality of the first structural units of the forward hierarchical connection group Step-by-step inter-scale information fusion, the backward hierarchical connectivity group is used to pair each of the first structural units in the forward hierarchical connectivity group through a plurality of the first structural units of the backward hierarchical connectivity group. Cross-scale information fusion is performed on the information output by the structural units, and the channel scale selection module is used to sum the information output by each of the first structural units in the backward hierarchical connection group through the summation unit to obtain The classification identification information.
The method according to claim 2, wherein the pedestrian re-identification network model also includes a 3x3 convolutional network module, and the 3x3 convolutional network module is connected to the forward layered connection group;

The 3x3 convolutional network module is used to process the feature data of the first picture by using a plurality of separable 3x3 convolutional networks, and obtain the feature data of the second picture after processing; and

The forward hierarchical connection group is used to perform step-by-step inter-scale information fusion on the second picture feature data through a plurality of the first structural units.
The method according to claim 2, wherein the pedestrian re-identification network model further includes a translational convolution module, and the translational convolution module is connected to the forward layered connection group;

The translation convolution module is used to process the feature data of the first picture through a translation operation and a 1x1 convolution network, and obtain the feature data of the third picture after processing; and

The forward hierarchical connection group is used to perform step-by-step inter-scale information fusion on the third picture feature data through a plurality of the first structural units.
The method according to claim 4, wherein the translational convolution module includes a second structural unit having the same structure as the first structural unit, and the translational convolution module is used to pass one or sequentially connected multiple The second structural unit processes the first picture feature data to obtain the third picture feature data after processing.
The method according to claim 5, wherein the translational convolution module further includes a pooling unit, and the pooling unit is located at the first of the plurality of sequentially connected second structural units. Between the second structural unit and the second second structural unit.
The method according to claim 1, wherein the first picture feature data is feature data obtained after processing original picture data with a quantization algorithm of an arbitrary bit quantization network DoReFa-Net.
A data processing device applied to FPGA hardware architecture, characterized in that said device comprises:

An acquisition module, configured to acquire the feature data of the first picture to be processed; and

A processing module, configured to input the first picture feature data into a pedestrian re-identification network model based on contextual multi-scale feature learning, and obtain classification and identification information output by the pedestrian re-identification network model;

Wherein, the structural block of the pedestrian re-identification network model includes a sequentially connected forward layered connection group, a backward layered connection group, and a channel scale selection module, and the forward layered connection group and the backward layered Each connection group contains a plurality of first structural units, and the first structural units include the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch A standardized network and a linear activation function network, the channel scale selection module includes a summing unit.
A kind of FPGA hardware architecture, it is characterized in that, described FPGA hardware architecture comprises central processing unit, memory, computing unit processing unit, pooling unit and residual unit, controller;

The central processing unit is used to receive picture feature data to be processed, and store the picture feature data into the memory;

One or more arithmetic logic unit matrices are set in the calculation unit processing part, and the calculation unit processing part is used to read the picture feature data from the memory, and use the arithmetic logic unit matrix to Image feature data is processed by 1x1 convolution;

The pooling component is used to perform pooling processing on the picture feature data output by the computing unit processing component;

The residual component is configured to perform residual accumulation processing on the picture feature data output by the pooling component and/or the picture feature data output by the calculation unit processing component; and

The controller is used to control the processing unit of the computing unit to read image feature data, weight data and translation parameters from the memory according to the pedestrian re-identification network model to perform 1x1 convolution processing, and to control whether the pooling unit Perform pooling processing on the feature data output by the computing unit processing part, and control whether the residual part reads the picture feature data from the memory and whether the picture feature data input to the residual part performs residual Poor handling;

Wherein, the structural block of the pedestrian re-identification network model includes a sequentially connected forward layered connection group, a backward layered connection group, and a channel scale selection module, and the forward layered connection group and the backward layered Each connection group contains a plurality of first structural units, and the first structural units include the first 1x1 convolutional network, the first batch normalization network, the first translation network, the second 1x1 convolutional network, the second batch A standardized network and a linear activation function network, the channel scale selection module includes a summing unit.
A computer device, characterized by comprising a memory and one or more processors, wherein computer readable instructions are stored in the memory, and when the computer readable instructions are executed by the one or more processors, the The one or more processors execute the steps of the method according to any one of claims 1-7.
One or more non-transitory computer-readable storage media storing computer-readable instructions, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors Execute the steps of the method according to any one of claims 1-7.