CN112132746A

CN112132746A - Small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment

Info

Publication number: CN112132746A
Application number: CN202010982493.9A
Authority: CN
Inventors: 李旭; 朱建潇; 赵琬婷; 徐启敏
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2020-12-25
Anticipated expiration: 2040-09-17
Also published as: CN112132746B

Abstract

The invention discloses a small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment, which comprises the following steps: collecting and constructing a small-scale pedestrian high-low resolution data training set; based on a countermeasure idea, a lightweight generation network for a low-resolution small-scale pedestrian image is built, the network firstly utilizes separable convolution to extract image primary features, then a residual error module is combined to fit high-frequency information, and finally a pixel recombination module is used for carrying out high-resolution reconstruction on the low-resolution pedestrian image; building a discrimination network, and performing discrimination training on parameters of the generated network to obtain an optimal generated network; and carrying out super-resolution on the low-resolution small-scale pedestrian picture by utilizing the optimal generation network to obtain a high-resolution pedestrian target. The lightweight super-resolution generation network designed by the invention has the remarkable advantages of short training time and low inference delay, and fills the blank of the small-scale pedestrian real-time super-resolution technology in the intelligent roadside field.

Description

Small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment

Technical Field

The invention belongs to the field of computer vision and intelligent traffic, and relates to a small-scale pedestrian target super-resolution method in an intelligent traffic roadside scene image, in particular to a small-scale pedestrian target rapid super-resolution method facing intelligent roadside equipment.

Background

Along with the rapid increase of road traffic scale in China, traffic accidents between pedestrians and vehicles occur frequently, the safety performance of the vehicles is continuously improved, but safety guarantee equipment for the pedestrians is in a lack state all the time. In order to ensure the basic safety of pedestrians, an intelligent road side system which adopts an electronic informatization technology to assist pedestrians to perform safety early warning on surrounding drivers or intelligent automobiles becomes the key point of domestic and foreign research. The accurate identification of the small-scale pedestrians is a precondition for guaranteeing the quick response of the safety early warning system, however, the target visual features of the small-scale pedestrians are sparse, the effectiveness of detection is difficult to guarantee by only relying on a pedestrian detection algorithm, and the super-resolution method is paid extensive attention to in order to improve the identification capability of the small-scale targets.

There are three main categories of general super-resolution methods, the first category being interpolation-based methods, such as nearest neighbor interpolation, bilinear interpolation and bicubic interpolation, which have the advantage of small computation, but neglects the common problems of motion blur and the like in the traffic field, the second method is based on reconstruction, mainly represented by an iterative back projection method, a convex set projection method, a maximum posterior probability method and the like, and the method is greatly improved compared with an interpolation method aiming at the reconstruction of a single scene, but for the problem of complex background, there are still cases where the peak signal-to-noise ratio is too low, the third category of methods is based on learning, the method mainly represents deep learning and sparse coding methods, has the characteristics of strong scene adaptation and high feature robustness, and is relatively fit with the super-resolution requirement of small-scale pedestrian targets under intelligent roadside viewing angles.

However, the existing deep learning super-resolution methods all have the serious problem of overlong algorithm execution time due to the complexity of a network structure, and are difficult to be directly used in intelligent road side equipment with high real-time requirement. Therefore, designing a rapid small-scale pedestrian target super-resolution method becomes a core loop for promoting the development of intelligent roadside equipment and ensuring the life safety of pedestrians.

Disclosure of Invention

In order to solve the problems, the invention discloses a small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment, which effectively fills the blank of the small-scale pedestrian real-time super-resolution technology in the field of intelligent roadside at present and further improves the functionality and intelligence of the intelligent roadside equipment.

In order to achieve the purpose, the invention provides the following technical scheme:

a small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment comprises the following steps:

(1) the method comprises the steps of collecting high-resolution small-scale pedestrian images containing various intelligent roadside scenes, obtaining a low-resolution small-scale pedestrian image set by using a down-sampling method, and constructing a small-scale pedestrian multi-resolution image training data set with the sample size of N by using the corresponding relation of high-resolution images and low-resolution images.

(2) And designing a generation network of the small-scale pedestrian target rapid super-resolution network based on the countermeasure concept. Firstly, the small-scale target semantic features in a low-resolution sample picture are preliminarily extracted through a lightweight convolution structure such as separable convolution and feature map compression, secondly, a residual error structure is stacked to form a residual error block, the residual error block is used as an estimation unit of high-frequency information of a high-resolution sample, so that the high-frequency difference between the low-resolution sample and the high-resolution sample can be fitted by using the residual error structure on the premise of not introducing excessive parameters, further, the output of the residual error block is connected into a feature compression convolution to reduce the dimensionality of feature parameters and ensure the real-time performance of an algorithm, then, a method of adding elements one by one is used, and lateral connection is introduced again to form a double feedforward structure, so that the gradient dispersion condition during gradient reverse propagation can be effectively avoided, and the training iteration number of generating the whole network can be remarkably reduced, and thirdly, the feature map is up-sampled by utilizing the pixel recombination layer, the jagged edges formed in the linear sampling or bilinear sampling process are avoided, so that a high-resolution feature map with higher quality is obtained, and finally, a high-resolution pedestrian picture is generated through a separable convolution structure. The network structure of this part is designed as follows:

layer 1 input layer: the number of input channels is 3, the resolution is A multiplied by A, and the output is A multiplied by 3 characteristic diagram.

Layer 2 feature extraction layer: 64 convolution kernels with the size of 7 × 1 and the step size of 1, and the output is a × 64 feature map.

Layer 3 feature extraction layer: 64 convolution kernels with the size of 1 × 7 and the step size of 1, and the output is a characteristic diagram of A × A × 64.

Layer 4 feature extraction layer: 256 convolution kernels of size 3 × 3 with step size 1, and the output is a × 256 feature map.

Layer 5 feature extraction layer: 128 convolution kernels of size 3 × 3 with step size 1, and the output is a × 128 feature map.

Layer 6 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step length of 1 and the output is A multiplied by 256 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 256 characteristic diagram; the pRelu activation function activates the layer, and the output is A multiplied by 256 characteristic map; 128 convolutional layers with the size of 1 multiplied by 1 and the step length of 1 and the output is A multiplied by 128 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 128 feature map; and performing element-by-element addition on the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block, and outputting an A multiplied by 128 feature map.

Layer 7 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step length of 1 and the output is A multiplied by 256 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 256 characteristic diagram; the pRelu activation function activates the layer, and the output is A multiplied by 256 characteristic map; 128 convolutional layers with the size of 1 multiplied by 1 and the step length of 1 and the output is A multiplied by 128 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 128 feature map; and performing element-by-element addition on the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block, and outputting an A multiplied by 128 feature map.

Layer 8 generating a structure convolution layer: 128 convolution kernels of size 3 × 3 with step size 1, and the output is a × 128 feature map.

Layer 9 creates a structural lateral connection layer: and adding the output characteristic diagram of the 8 th layer generation structure convolution layer and the input characteristic diagram of the 6 th layer generation structure residual block element by element to output an A multiplied by 128 characteristic diagram.

Layer 10 generates the structural upsampling layer: the upsampling of the feature map is realized by using a pixel reconstruction layer (PixelShuffler), and the output is the feature map of 2A multiplied by 128.

Layer 11 generates the structural upsampling layer: the up-sampling of the feature map is realized by using a pixel reconstruction layer (pixelshuffle), and the output is a 4A × 4A × 128 feature map.

Layer 12 generating a structure convolution layer: 3 separable convolution kernels of size 9 × 9 with step size 1, and high resolution pedestrian pictures of 4A × are output.

(3) Based on the idea of generating countermeasures, the invention designs a discrimination network of a small-scale pedestrian target rapid super-resolution network, extracts a plurality of semantic features of a target by means of a feature extraction structure in an Inception V2 network, then introduces a full connection layer with an output class of 2, further extracts the semantic features, finally utilizes a sigmoid activation function to normalize the output of the full connection layer by 0-1, outputs a truth estimation p for a given input picture, integrates the above structures to form a discrimination network D for judging truth of the generated high-resolution sample estimation_θWherein θ is a parameter of the discrimination network, and the network structure is as follows:

layer 1 input layer: the number of input channels is 3, the resolution is 4A multiplied by 4A, and the output is a feature map of 4A multiplied by 3.

Layer 2 feature extraction layer: the characteristic extraction layer of Incepison V2 is selected as the structure of the layer, and the output is

The characteristic diagram of (1).

Layer 3 full connection layer: the three-dimensional input is stretched to 1-dimension, and the output class is 2.

Layer 4 normalization layer: and normalizing the result output by the 3 rd layer full-connection layer by using a sigmoid function, wherein the output category is 2.

(4) And training a network model for the designed generation network and the discrimination network. First, low resolution samples are sampled

As a generating network G_ωThe high resolution sample estimate is obtained by forward propagation through various convolutions

Computing high resolution sample estimates

And high resolution samples

Value of content loss L in between_con. Next, the true false label of the high resolution sample estimate is set to 0, the high resolution sample

The true and false label of (2) is set to 1 to obtain a label value y_kEstimating the truth of the inferred high-resolution picture and the original high-resolution picture by using a discrimination network to obtain an estimated value

Calculating an estimate

And the tag value y_kIs a binary cross entropy loss value L_cro. Finally, the gradient is propagated reversely by utilizing the two loss values. The detailed steps of this section are as follows:

substep 1: the forward propagation is computed. Sampling low resolution

As input for generating the network, high resolution sample estimates are obtained through various convolution operations

Labeling the high resolution sample estimates

True and false label y_kSet to a value of 0, corresponding to a high resolution sample

True and false label y_kSet to 1, high resolution sample estimation using discriminant network

And high resolution samples

Carrying out truth estimation to obtain an estimated value

Substep 2: the loss value is calculated. Judging loss value of network as truth estimated value

And the value y of true and false label_kCross entropy of two classes L_croThe specific calculation formula is as follows:

generating loss values for a network as high resolution sample estimates

And corresponding high resolution samples

At discriminator D_θMean square error L of tail layer feature map of feature extraction layer_conThe specific calculation formula is as follows:

where Φ (x) represents a given input sample x passing through the discriminator D_θThe tail layer feature map extracted by the feature extraction layer, W, H, respectively represents the width and height of the extracted feature map.

Substep 3: and carrying out gradient back propagation, and storing parameter values of the generated network and the judgment network in each iteration process.

Substep 4: and selecting the network parameter with the lowest sum of the generated network and the judgment network loss in the iteration process of the substep 3 as the optimal network parameter, and selecting the network model corresponding to the optimal network parameter as the optimal model.

(5) And (4) performing super-resolution operation on the small-scale pedestrian target under the intelligent roadside viewing angle by using the generation network in the optimal model output in the step 4.4.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the invention provides a small-scale pedestrian image target super-resolution method suitable for the intelligent roadside field.

2. The small-scale pedestrian super-resolution method based on the generation countermeasure network structure effectively overcomes the defects of high difficulty and serious inference time consumption of a universal super-resolution neural network, has the obvious advantages of low calculation delay and high output frequency, and has good peak signal-to-noise ratio and structural similarity of the output super-resolution result.

Drawings

Fig. 1 is a schematic diagram of a generation network structure of a small-scale pedestrian target rapid super-resolution method designed by the invention.

Fig. 2 is a schematic diagram of a discrimination network structure of the small-scale pedestrian target rapid super-resolution method designed by the invention.

FIG. 3 is the general steps of the method for rapid super-resolution of small-scale pedestrian targets designed by the present invention.

Fig. 4 is a graph comparing the effect of the method designed by the present invention and the typical super-resolution algorithm network SRGAN.

Detailed Description

The technical solutions provided by the present invention will be described in detail below with reference to specific examples, and it should be understood that the following specific embodiments are only illustrative of the present invention and are not intended to limit the scope of the present invention.

According to the small-scale pedestrian target rapid super-resolution method for the intelligent roadside equipment, a network model data support is formed by constructing a small-scale pedestrian multi-resolution training data set of an intelligent roadside scene, a countermeasure network concept is generated in combination, a lightweight super-resolution generation network and a discrimination network suitable for a roadside real-time scene are designed aiming at the specific problems of long execution time, serious training time consumption and the like of a traditional super-resolution network model algorithm, and finally the generation network is trained by combining a feature map loss function and a true and false discrimination loss function to obtain an optimal model of the small-scale pedestrian target rapid super-resolution network. Compared with other traditional super-resolution networks, the method designed by the invention has core advantages in two aspects of model overall characteristics, scene adaptability and the like. On the model characteristic, the main body of the method designed by the invention is composed of light-weight structures such as a residual block and separable convolution, and has the remarkable characteristics of less parameters and short execution time; in the aspect of scene adaptability, a traditional pedestrian super-resolution network model is difficult to be directly suitable for an intelligent roadside scene, and the method enables the transfer learning cost of the algorithm model to be lower and the scene adaptability to be stronger by constructing a targeted small-scale roadside pedestrian data set. The specific process of the invention comprises the following steps:

(1) and constructing a small-scale pedestrian multi-resolution image training data set containing various intelligent roadside scenes. Compared with the traditional super-resolution data set, the roadside scene small-scale pedestrian multi-resolution training data set has complex limiting conditions, and adoptsThe method comprises the steps of firstly defining a pedestrian target with the pixel height H smaller than a pixel threshold value H as a small-scale pedestrian, in an actual traffic scene, selecting the threshold value H as 100 pixel elements, then acquiring a pedestrian target picture containing different intelligent roadside scenes and having the resolution of 1920 x 1080, intercepting an area containing the small target pedestrian as a high-resolution sample, setting the resolution of the intercepted area to be 512 x 512, and setting the interception mode to be target-centered interception, thereby forming a high-resolution sample set with the size of N

Under the condition of fully considering the influence degree of data scale on the designed method, the picture acquisition cost and other factors, N is selected to be 5000, and after the maximum value down-sampling with the step length of 4 is carried out on the samples in the high-resolution set, the low-resolution sample set with one-to-one correspondence of the samples is formed

Wherein p is_kRefers to sample No. k in the N sample pictures, HD refers to a sample resolution of 512 × 512, and LD refers to a sample resolution of 128 × 128. Concentrating arbitrary samples using high and low resolution samples

Down-sampling corresponding relation between them to form sample pair

Synthesizing N sample pairs to form small-scale pedestrian multi-resolution image training data set S_Train。

(2) And designing a generation network of the small-scale pedestrian target rapid super-resolution network based on the countermeasure concept. The effect of the generating network is to generate for a given low resolution sample

Generating high resolution samples corresponding thereto

High quality estimation of

This requires that the designed generation network can completely retain the semantic information of the low-resolution samples, and has the capability of completely supplementing the high-frequency information such as brightness, texture, edge, and the like. Due to the complex requirement, most of the existing super-resolution networks depend on huge parameters and are difficult to deploy in the intelligent roadside field. In consideration of the above requirements and practical problems, the present invention integrates the factors of generated picture similarity, feature map similarity, algorithm execution time, etc., and designs the following generated network G_ωWhere ω is a parameter of the generating network: firstly, preliminarily extracting semantic features of a target through a lightweight convolution structure such as separable convolution and feature map compression, secondly, stacking a residual structure to form a residual block, using the residual block as an estimation unit of image high-frequency information, fitting a high-frequency difference between a low-resolution sample and a high-resolution sample by using the residual structure on the premise of not introducing excessive parameters, further, connecting the output of the residual block into a feature compression convolution to reduce the dimensionality of feature parameters and ensure the real-time performance of an algorithm, then, utilizing a method of adding elements one by one, introducing lateral connection again to form a double feedforward structure, thus effectively avoiding the gradient dispersion condition during gradient reverse propagation, remarkably reducing the training iteration times of generating the whole network, and finally, utilizing a pixel recombination layer to perform upsampling on the feature map, and finally, generating a high-resolution pedestrian picture through a separable convolution structure. The network structure of this part is designed as follows:

layer 1 input layer: the number of input channels is 3, the resolution size is 128 × 128, and the output is a feature map of 128 × 128 × 3.

Layer 2 feature extraction layer: 64 convolution kernels of size 7 × 1 with step size 1, and a feature map of 128 × 128 × 64 is output.

Layer 3 feature extraction layer: 64 convolution kernels of size 1 × 7, step size 1, and output a 128 × 128 × 64 feature map.

Layer 4 feature extraction layer: 256 convolution kernels of size 3 × 3 with step size 1, and a feature map of 128 × 128 × 256 is output.

Layer 5 feature extraction layer: 128 convolution kernels of size 3 × 3 with step size 1, and a 128 × 128 × 128 feature map is output.

Layer 6 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step size of 1 and the output of a characteristic diagram of 128 multiplied by 25; the batch regularization processing layer outputs a feature map of 128 multiplied by 256; the pRelu activation function activates the layer, and the output is a 128 × 128 × 256 feature map; 128 convolutional layers with the size of 1 multiplied by 1 and the step size of 1, and outputting a characteristic diagram of 128 multiplied by 128; the batch regularization processing layer outputs a 128 x 128 feature map; and performing element-by-element addition on the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block, and outputting a 128 × 128 × 128 feature map.

Layer 7 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step size of 1 and the output of a characteristic diagram of 128 multiplied by 25; the batch regularization processing layer outputs a feature map of 128 multiplied by 256; the pRelu activation function activates the layer, and the output is a 128 × 128 × 256 feature map; 128 convolutional layers with the size of 1 multiplied by 1 and the step size of 1, and outputting a characteristic diagram of 128 multiplied by 128; the batch regularization processing layer outputs a 128 x 128 feature map; and performing element-by-element addition on the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block, and outputting a 128 × 128 × 128 feature map.

Layer 8 generating a structure convolution layer: 128 convolution kernels of size 3 × 3 with step size 1, and a 128 × 128 × 128 feature map is output.

Layer 9 creates a structural lateral connection layer: and adding the output characteristic diagram of the 8 th layer generation structure convolution layer and the input characteristic diagram of the 6 th layer generation structure residual block element by element to output a characteristic diagram of 128 multiplied by 128.

Layer 10 generates the structural upsampling layer: the upsampling of the feature map is implemented by using a pixel reconstruction layer (pixelshuffle), and the output is a feature map of 256 × 256 × 128.

Layer 11 generates the structural upsampling layer: the upsampling of the feature map is realized by using a pixel reconstruction layer (pixelshuffle), and the output is a feature map of 512 × 512 × 128.

Layer 12 generating a structure convolution layer: 3 separable convolution kernels of size 9 × 9 with step size 1, and a high resolution pedestrian picture of 512 × 512 × 3 is output.

(3) Based on the countermeasure idea, a discrimination network of a small-scale pedestrian target rapid super-resolution network is designed to process high-resolution samples

And the high resolution sample estimate generated in step 2

And (5) judging whether the product is true or false. The function of the discrimination network is to introduce a true and false supervision signal into the generation network in the form of an additional label to guide the generation network to generate a more vivid high-resolution picture, but the traditional large-scale discrimination network can bring better supervision performance, but also brings huge parameters and time consumption for training, and easily causes model divergence. Compared with other typical backbone networks (such as ResNet and DenseNet), the Inception V2 network has the specific advantages of moderate parameter quantity and strong extraction capability on image bottom layer features, so that by means of the feature extraction structure in the Inception V2 network, after a plurality of semantic features of a target are extracted, a full connection layer with the output class of 2 is introduced, the semantic features are extracted with high precision, finally, the output of the full connection layer is normalized by 0-1 by using a sigmoid activation function, the true degree estimation p of a given input picture is output, and the structure is integrated to form a discrimination network D for performing true and false judgment on the generated high-resolution sample estimation_θWhere θ is a parameter of the discrimination network, netThe complex structure is as follows:

layer 1 input layer: the number of input channels is 3, the resolution is 512 × 512, and the output is a characteristic diagram of 512 × 512 × 3.

Layer 2 feature extraction layer: the IncepotionV 2 feature extraction layer is selected as the structure of the layer, and the feature map with the output of 32 x 256 is output.

Layer 3 full connection layer: the three-dimensional input is stretched to 1-dimension, i.e. the number of inputs is 262144 and the output class is 2.

(4) For the designed generation network G_ωAnd discriminating network D_θAnd training the network model. The training loss function definition of the super-resolution method based on the generation countermeasure network emphasizes the characteristics of the generated target and the original target on all dimensions, and the manual characteristic setting mode is not suitable for complicated intelligent roadside scenes, so that the problem of the invention is that the loss function based on the characteristic diagram difference is innovatively designed for training the network model, and the specific content is that firstly, a low-resolution sample is used for training the network model

Computing high resolution sample estimates

And high resolution samples

Calculating an estimate

substep 1: the forward propagation is computed. Sampling low resolution

Labeling the high resolution sample estimates

And high resolution samples

Carrying out truth estimation to obtain an estimated value

Substep 2: meterAnd calculating a loss value. Judging loss value of network as truth estimated value

generating loss values for a network as high resolution sample estimates

And corresponding high resolution samples

Substep 3: gradient back propagation is performed. The mode of determining gradient back propagation is a random gradient descent method, the learning rate is 0.0001, and the iteration number C of the iteration process is 10000. In the iterative process, training parameters of the generated network and the judgment network are corrected by utilizing gradient back propagation, and the network parameters are stored as

Wherein Net is a parameter for forming and judging the network, and e is the current iteration number.

Substep 4: selecting the network parameter with the lowest sum of the generated network and the judgment network loss in the iteration process of the substep 3

And the network model corresponding to the optimal network parameter is the optimal model.

(5) And performing super-resolution operation on the small-scale pedestrian target under the intelligent roadside viewing angle by using a generation network in the optimal model.

(6) In order to verify the effectiveness of the network designed by the invention, a typical super-resolution network SRGAN (Takano N, Alaghband G.SRGAN: Training Dataset Matters [ J ].2019.) is selected as a reference, and is compared with two most typical parameters of peak signal-to-noise ratio and structural similarity of the network model designed by the invention under the same Training data and Training conditions, and a comparison table is shown in Table 1, wherein the higher the peak signal-to-noise ratio is, the less the image distortion is, the higher the structural similarity is, and the higher the similarity between the image and the truth value is. The method designed by the invention greatly shortens the operation time at the cost of slightly losing the peak signal-to-noise ratio and the structural similarity, and completely meets the real-time application requirements in the intelligent roadside field. Figure 3 graphically illustrates the super-resolution comparison of the present invention with SRGAN.

TABLE 1 evaluation results of small-scale pedestrian super-resolution network and SRGAN generation pedestrian image designed by the invention

Claims

1. A small-scale pedestrian target rapid super-resolution method for intelligent roadside equipment is characterized by comprising the following steps:

(1) collecting high-resolution small-scale pedestrian images containing various intelligent roadside scenes, obtaining a low-resolution small-scale pedestrian image set by using a down-sampling method, and constructing a small-scale pedestrian multi-resolution image training data set with the sample size of N by using the corresponding relation of high-resolution images and low-resolution images;

(2) designing a generation network of a small-scale pedestrian target rapid super-resolution network based on a countermeasure idea; firstly, preliminarily extracting small-scale target semantic features in a low-resolution sample picture through a lightweight convolution structure such as separable convolution and feature map compression, secondly, stacking a residual structure to form a residual block, using the residual block as an estimation unit of high-resolution sample high-frequency information, further, connecting the output of the residual block into a feature compression convolution, then, utilizing a method of adding elements one by one, introducing lateral connection again to form a double feedforward structure, thirdly, utilizing a pixel recombination layer to perform upsampling on the feature map, thereby obtaining a higher-quality high-resolution feature map, and finally, generating a high-resolution pedestrian picture through the separable convolution structure, wherein the network structure of the part is designed as follows:

layer 1 input layer: the number of input channels is 3, the resolution is A multiplied by A, and the output is a characteristic diagram of A multiplied by 3;

layer 2 feature extraction layer: 64 convolution kernels with the size of 7 multiplied by 1 and the step size of 1 are output as a characteristic diagram of A multiplied by 64;

layer 3 feature extraction layer: 64 convolution kernels with the size of 1 × 7 and the step size of 1, and outputting a characteristic graph of A × 64;

layer 4 feature extraction layer: 256 convolution kernels with the size of 3 multiplied by 3, the step length is 1, and the output is A multiplied by 256 characteristic diagram;

layer 5 feature extraction layer: 128 convolution kernels with the size of 3 multiplied by 3, the step size is 1, and the output is A multiplied by 128 characteristic diagram;

layer 6 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step length of 1 and the output is A multiplied by 256 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 256 characteristic diagram; the pRelu activation function activates the layer, and the output is A multiplied by 256 characteristic map; 128 convolutional layers with the size of 1 multiplied by 1 and the step length of 1 and the output is A multiplied by 128 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 128 feature map; adding the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block element by element, and outputting an A multiplied by 128 feature map;

layer 7 generating a structural residual block: 256 convolution layers with the size of 3 multiplied by 3 and the step length of 1 and the output is A multiplied by 256 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 256 characteristic diagram; the pRelu activation function activates the layer, and the output is A multiplied by 256 characteristic map; 128 convolutional layers with the size of 1 multiplied by 1 and the step length of 1 and the output is A multiplied by 128 characteristic diagram; the batch regularization processing layer outputs an A multiplied by 128 feature map; adding the input feature map of the generated structural residual block and the last batch regularization processing layer of the generated structural residual block element by element, and outputting an A multiplied by 128 feature map;

layer 8 generating a structure convolution layer: 128 convolution kernels with the size of 3 multiplied by 3, the step size is 1, and the output is A multiplied by 128 characteristic diagram;

layer 9 creates a structural lateral connection layer: adding the output characteristic diagram of the 8 th layer generation structure convolution layer and the input characteristic diagram of the 6 th layer generation structure residual block element by element to output an A multiplied by 128 characteristic diagram;

layer 10 generates the structural upsampling layer: realizing the up-sampling of the feature map by utilizing the pixel recombination layer, and outputting the feature map with the output of 2A multiplied by 128;

layer 11 generates the structural upsampling layer: realizing the up-sampling of the feature map by utilizing the pixel recombination layer, and outputting the feature map with 4A multiplied by 128;

layer 12 generating a structure convolution layer: 3 separable convolution kernels with the size of 9 multiplied by 9 and the step length of 1 are output to be high-resolution pedestrian pictures with the size of 4A multiplied by 4A;

(3) based on the countermeasure idea generation, a discrimination network of a small-scale pedestrian target rapid super-resolution network is designed, a feature extraction structure in an Inception V2 network is used for extracting multiple semantic features of a target, a full connection layer with an output class of 2 is introduced, the semantic features are further extracted, finally, a sigmoid activation function is used for carrying out 0-1 normalization on the output of the full connection layer, a truth degree estimation p for a given input picture is output, and the structure is integrated to form a discrimination network D for carrying out truth judgment on the generated high-resolution sample estimation_θWherein θ is a parameter of the discrimination network, and the network structure is as follows:

layer 1 input layer: the number of input channels is 3, the resolution is 4A multiplied by 4A, and a feature map of 4A multiplied by 3 is output;

A characteristic diagram of (1);

layer 3 full connection layer: stretching the three-dimensional input to 1 dimension, wherein the output category is 2;

layer 4 normalization layer: normalizing the result output by the 3 rd layer full-connection layer by using a sigmoid function, wherein the output category is 2;

(4) training a network model for the designed generating network and the judging network; first, low resolution samples are sampled

Computing high resolution sample estimates

And high resolution samples

Value of content loss L in between_con(ii) a Next, the true false label of the high resolution sample estimate is set to 0, the high resolution sample

Calculating an estimate

And the tag value y_kIs a binary cross entropy loss value L_cro(ii) a Finally, the two loss values are utilized to carry out gradient back propagation; the detailed steps of this section are as follows:

substep 1: calculating forward propagation; sampling low resolution

Labeling the high resolution sample estimates

And high resolution samples

Carrying out truth estimation to obtain an estimated value

Substep 2: calculating a loss value; judging loss value of network as truth estimated value

And the value y of true and false label_kCross entropy of two classes L_croConcrete meterThe calculation formula is as follows:

generating loss values for a network as high resolution sample estimates

And corresponding high resolution samples

where Φ (x) represents a given input sample x passing through the discriminator D_θThe tail layer feature map extracted by the feature extraction layer is W, H respectively representing the width and the height of the extracted feature map;

substep 3: carrying out gradient back propagation, and storing parameter values of a generated network and a judgment network in each iteration process;

substep 4: selecting the network parameter with the lowest sum of the generated network and the judgment network loss in the iteration process of the substep 3 as an optimal network parameter, and selecting the network model corresponding to the optimal network parameter as an optimal model;