CN116542864A

CN116542864A - Unmanned aerial vehicle image defogging method based on global and local double-branch network

Info

Publication number: CN116542864A
Application number: CN202310037485.0A
Authority: CN
Inventors: 李红光; 龙飞宇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-08-04

Abstract

The invention provides an unmanned aerial vehicle image defogging method based on global and local double-branch networks, and belongs to the field of image processing; the method comprises the following steps: firstly, generating fog images with different concentrations for the existing unmanned aerial vehicle by using an atmospheric scattering model; constructing a defogging network model based on global and local double-branch images; then, inputting each foggy image in the data set into a global and local double branches at the same time, respectively obtaining a global structural feature image and a local detail feature image, and then inputting the global structural feature image and the local detail feature image into a feature fusion module to obtain defogging parameters corresponding to each image; then, constructing a mixed loss function of image defogging training; utilizing images formed by the defogging and foggy images to train an image defogging network model, and optimizing parameters of the image defogging network model through a loss function; finally, inputting a new single foggy image into an optimal image defogging network model, and outputting a defogged image; the invention can process images with various sizes, and has fast processing speed and wide application range.

Description

Unmanned aerial vehicle image defogging method based on global and local double-branch network

Technical Field

The invention belongs to the field of image processing, and particularly relates to an unmanned aerial vehicle image defogging method based on global and local double-branch networks.

Background

In recent years, due to the development of industry, the atmosphere is seriously polluted, and severe weather such as haze and the like is continuously generated. The images photographed in haze weather often have problems such as blurry, color shift, reduced contrast, buried details, and the like. Therefore, the haze weather can seriously interfere with outdoor photography and image processing, and limits the performance of tasks such as security monitoring, automobile automatic driving and the like.

The unmanned aerial vehicle is high in flying height, and the distance between the unmanned aerial vehicle and the target is large, so that the unmanned aerial vehicle is more easily affected by haze weather compared with a ground image. The haze concentration in the image is related to the distance between the shooting point and the target, and the greater the distance is, the greater the haze concentration is. Unmanned aerial vehicle can squint ground in many cases when flight is shot, can have the position that is close and the position that is far away in the image simultaneously when squinting ground, can lead to in the image haze concentration distribution inhomogeneous like this, as shown in fig. 1. Uneven haze can bring a plurality of adverse effects to subsequent unmanned aerial vehicle image target detection, identification, tracking, positioning and other tasks.

The conventional image defogging method is often aimed at uniformly distributed haze, so that a method for effectively defogging and sharpening unmanned aerial vehicle images with uneven haze distribution is needed. The existing image defogging data sets comprise D-HAZY, NH-HAZE, RESIDE and the like, but most of the data sets are images shot on the ground, so that a method for synthesizing the non-uniform foggy image data set under the view angle of the unmanned aerial vehicle is needed to synthesize the foggy image of the unmanned aerial vehicle.

The existing image defogging algorithm is divided into a method based on traditional image processing and a method based on deep learning; the first class mostly relies on an atmospheric scattering model and prior information, and a foggy image is obtained through estimating a scene depth map and atmospheric ambient light according to a foggy image derivation. The method is limited by an atmospheric scattering model and various priori information under ideal conditions, and has great limitation in complex foggy images.

The second class of methods can be divided into methods based on an atmospheric scattering model and methods based on image conversion; the former uses an atmospheric scattering model, and uses a neural network to estimate a scene depth map and atmospheric environment light; the latter ignores the atmospheric scattering model and generates a haze-free image from direct conversion of the haze image. The method based on deep learning is superior to the method based on traditional image processing in terms of processing speed, processing effect and the like.

Disclosure of Invention

Aiming at the problems that haze distribution is uneven under an unmanned aerial vehicle visual angle, image restoration cannot be effectively achieved and local optimization is easy to fall into in the existing method, the invention provides a defogging method for an unmanned aerial vehicle image based on global and local double-branch networks by respectively extracting image detail information and overall structure information, and the defogging method is independent of complex priori assumptions, and a clear defogging image can be obtained by inputting a single foggy image.

The method comprises the following specific steps:

firstly, respectively estimating depth maps of all images by using a depth estimation model by using an existing unmanned aerial vehicle image, and generating foggy images with different concentrations by using an atmospheric scattering model;

each pair of defogging images is synthesized into a plurality of foggy images, and each foggy image and the corresponding defogging image are combined into an image pair;

step two, constructing a defogging network model based on global and local double-branch images;

the image defogging network comprises a global structure branch, a local detail branch and a feature fusion module;

the global structure branch acquires the whole haze structure information of the image by using a transformer structure, the local detail branch acquires the detail information of the image by using depth separable convolution, and meanwhile, the calculated amount is reduced, and the feature fusion module carries out weighted fusion on the feature graphs of the two branches;

the local detail branch consists of 9 convolution layers and 4 pixel attention modules:

the convolution layers 1, 3, 5, 7 and 9 are PW convolutions with the number of convolution kernels being 3; the convolution layer 2 is DW convolution with the convolution kernel size of 3 multiplied by 3 and the number of the convolution kernels of 3; the convolution layer 4 is DW convolution with the convolution kernel size of 5 multiplied by 5 and the number of the convolution kernels of 6; the convolution layer 6 is DW convolution with the convolution kernel size of 7 multiplied by 7 and the number of the convolution kernels of 9; the convolution layer 8 is DW convolution with the convolution kernel size of 3 multiplied by 3 and the number of the convolution kernels of 12;

each pixel attention module consists of one DW convolution and two PW convolutions, wherein a ReLU activation function is set after a first PW convolution layer, and a Sigmoid activation function is set after a second PW convolution layer; reLU activation functions are set after convolutional layer 1, convolutional layer 3, convolutional layer 5, convolutional layer 7, and convolutional layer 9. The feature map size remains unchanged during the image's local detail branching.

The global structure branch consists of a pooled downsampling layer, a convolutional coding layer, a transducer module and an upsampling layer.

The pooling downsampling layer is self-adaptive pooling, and downsampling is carried out on the feature map by 8 times; the convolution coding layer is a convolution with a convolution kernel size of 3×3 and the number of convolution kernels is 16.

Each transducer module contains a multi-headed attention module and an MLP structure with a jump connection. LayerNorm normalization method and GELU activation function are used in the module.

Feature fusion module for inputting two features x ₁ ,x ₂ First x is set using the linear layer ₁ Projected toThereafter using global average pooling GAP (·), MLP layer F _MLP (. Cndot.) Softmax, partitioning operation to obtain corresponding fusion weights a ₁ ,a ₂ And output y. The overall operation is expressed as the formula:

step three, inputting each foggy image in the data set into a global double branch and a local double branch at the same time to respectively obtain a global structural feature map and a local detail feature map;

inputting the global structural feature map and the local detail feature map into a feature fusion module to obtain defogging parameters K (x) corresponding to each image;

the feature fusion module calculates global and local feature weights respectively and performs weighted fusion, and the defogging parameters K (x) are obtained after convolution of the fused features.

Step five, constructing an image defogging training loss function;

the loss function is formed by weighting and combining an L1 loss function, a structural similarity SSIM loss function and a comparison regularization loss function;

the L1 loss function is expressed as:

wherein J is _i GT for the ith pixel of the image output by the defogging network _i The ith pixel of the corresponding real haze-free image is obtained, and N is the total number of pixels of the image;

the structural similarity SSIM loss is expressed as:

wherein J is an image output by a fog network, GT is a corresponding real fog-free image, mu _J Sum mu _GT Mean value, sigma, of image output by defogging network and defogging image in window _J Sum sigma _GT Respectively representing standard deviation sigma of images output by defogging network and images without defogging in window _JGT The image output by the defogging network and the defogging image are represented by covariance in a window, and the window size is 11 multiplied by 11.C (C) ₁ And C ₂ Is a constant;

the contrast regularization loss is expressed as:

F _CR representing the ratio of the distance between the network output image and the corresponding real foggy image and the distance between the network output image and the corresponding foggy image in the feature space.

Total loss function F _loss The method comprises the following steps:wherein omega ₁ ,ω ₂ ,ω ₃ Is the corresponding weight.

Step six, utilizing images formed by the defogging and foggy images to train an image defogging network model, and optimizing parameters of the image defogging network model through a loss function;

training involves the following process:

1) Aiming at the foggy images in the image pair, performing data enhancement by using a vertical overturning, horizontal overturning and random cutting mode;

2) Inputting the foggy image with the enhanced data into an image defogging network model to obtain defogging parameters corresponding to each foggy image, and calculating defogged image J (x);

the formula is: j (x) =k (x) I (x) -K (x)

Wherein I (x) represents a currently input foggy image;

3) Calculating the loss between the output image J (x) and the defogging image in the image pair through a loss function, feeding back to the defogging network model of the image, and updating the model weight;

4) Repeating the step 2) until the preset iteration times are reached, and storing the parameter updated last time as the optimal parameter of the image defogging model;

step seven, inputting a new single image to be defogged into an optimal image defogging network model, and directly outputting defogged images;

the invention has the advantages that:

(1) According to the unmanned aerial vehicle image defogging method based on the global and local double-branch network, a large amount of data is used for training, and compared with the traditional method, the defogging effect is better;

(2) According to the image defogging method of the unmanned aerial vehicle based on the global and local double-branch networks, after an image defogging network model is trained, image defogging can be carried out by inputting a foggy image, and the foggy image is output without other information;

(3) The unmanned aerial vehicle image defogging method based on the global and local double-branch networks has no requirement on the size of the image, can process images with various sizes, and has the advantages of high processing speed and wide application range.

Drawings

FIG. 1 is a view image of an unmanned aerial vehicle with non-uniform haze concentration distribution in the prior art;

FIG. 2 is a flow chart of a defogging method for an image of a unmanned aerial vehicle based on global and local dual-branch networks according to the present invention;

FIG. 3 is a schematic diagram of the structure of an image defogging network model employed in the present invention;

FIG. 4 is a schematic diagram of a partial detail branching architecture employed in the present invention;

FIG. 5 is a schematic diagram of a pixel attention architecture diagram employed by the present invention;

FIG. 6 is an example of a foggy image of an unmanned aerial vehicle and its defogging results employed in the present invention;

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the accompanying drawings.

The invention discloses an unmanned aerial vehicle image defogging method based on global and local double-branch networks. Because the unmanned plane platform possibly has the problem of low equipment calculation force, the common convolution is replaced by the depth separable convolution in the local detail branch, and the transform module is firstly downsampled and then input in the global structure branch, so that the calculation amount and the parameter amount are greatly reduced while the defogging effect of the model is ensured. In order to better extract image features, a pixel attention module is used for strengthening the features of key areas in the feature map; and effectively fusing the global structure branch characteristic diagram and the local detail branch characteristic diagram by using a weighted fusion mode. In order to avoid that the network falls into the local optimum in the training process and cannot reach the optimum result, the method uses a mixed loss function and uses a weighted sum of multiple loss functions to avoid that the network falls into the local optimum. Simulation effects prove that the method has a good defogging effect on the unmanned aerial vehicle images with uneven haze distribution.

As shown in fig. 2, the specific steps are as follows:

step one, establishing an unmanned aerial vehicle foggy image data set by utilizing the existing unmanned aerial vehicle foggy images and combining a depth map of each foggy image;

aiming at a plurality of unmanned aerial vehicle images, respectively estimating depth maps of the images by using a depth estimation model, randomly selecting an atmospheric scattering coefficient and atmospheric environment light, and generating the fog images with different concentrations by using the atmospheric scattering model;

in a real hazy image, the haze distribution is generally uneven, and the haze concentration is closely related to the distance. Most current methods of synthesizing a foggy image require that the image itself have depth information, i.e., RGBD images, or that the depth d (x) value be fixed during the synthesis; the former greatly limits the range of synthesizing the foggy image, and can not synthesize on the common RGB image; the haze distribution in the image synthesized by the latter is completely uniform and is inconsistent with the reality.

The method does not require the image to contain depth information, and uses the depth estimation model to acquire the image depth map, the depth map can embody the distance condition in the image, and the foggy image which accords with the actual condition can be conveniently synthesized according to the depth map, so that the application range is wider, and the synthesis effect is better.

For example, a user may synthesize hazy images at various concentrations from an atmospheric scattering model of haze using images in the unmanned target detection dataset VisDrone 2019;

specifically, a MiDaS model is used for obtaining a depth map d (x) of a current defogging image, and a foggy image corresponding to the defogging image is synthesized by utilizing a formula 1;

equation 1: i (x) =j (x) e ^βd(x) +A(1-e ^βd(x) )

Wherein J (x) is a haze-free image, I (x) is a haze image, beta is an atmospheric scattering coefficient, the value range is [0.2,1], A is atmospheric environment light, and the value range is [0.7,0.9].

In the implementation process, each pair of defogging images is synthesized into 10 foggy images, and each foggy image and the corresponding defogging image are combined into an image pair;

as shown in fig. 3, the image defogging network includes two branches: the method comprises the steps of respectively forming a local detail branch and a global structure branch, and setting a feature fusion module to fuse features of the two branches after the two branches;

the global structure branch acquires the whole haze structure information of the image by using a transformer structure, the local detail branch acquires the detail information of the image by using a plurality of depth separable convolutions, the calculated amount is reduced, and the feature fusion module carries out weighted fusion on the feature graphs of the two branches;

1) Local detail branching structure

As shown in fig. 4, it consists of 9 convolution layers and 4 pixel attention modules:

The local detail branches designed by the embodiment use depth separable convolution, so that the parameter quantity and the calculated quantity of the network are reduced; the depth separable convolutions include DW convolutions and PW convolutions, with one convolution kernel of the DW convolutions being responsible for only one channel, which will only be convolved by one channel. DW convolution significantly reduces the computational effort compared to normal convolution. The PW convolution is similar to the common convolution calculation method, the convolution kernel size is 1×1×m, and M is the number of channels of the feature map of the upper layer. The PW convolution can carry out weighted summation on the feature images of the upper layer in the depth direction to generate a new feature image, and the information among different channels is fused to make up the defect of the DW convolution.

The present embodiment adds a pixel attention mechanism to the depth separable convolution, making the network more focused on important areas in the image. For the input feature F, three convolution layers are used to transform the dimensions of the feature from c×h×w to 1×h×w. Wherein the first convolution layer is DW convolution with convolution kernel size of 3×3, and the second convolution layer is convolution kernel number ofThe third convolution layer is PW convolution with a number of convolution kernels of 1. Features of dimension C x H x W are unchanged in dimension after passing through the first convolution layer and become +.>The dimension becomes 1×h×w after passing through the third convolution layer. The structure of the pixel attention is shown in FIG. 5; the process can be expressed as:

equation 2: pa=σ (Conv (δ (Conv (F)))

Wherein Conv () represents a convolution layer, delta () represents a ReLU activation function, and sigma () represents a Sigmoid activation function;

and multiplying the calculated PA feature and the input F feature by elements to obtain an output result of the pixel attention.

2) Global structure branching

Consists of a pooled downsampling layer, a convolution coding layer, a transducer module and an upsampling layer.

In the embodiment, the adaptive average pooling downsampling is performed firstly, and then the blocking flattening operation is performed, so that the effect of reducing the output characteristic dimension and the calculated amount of a transducer module is achieved.

Each transducer module contains a multi-headed attention module and an MLP structure with a jump connection. LayerNorm normalization method and GELU activation function are used in the module. Since the input of the transducer does not contain location information, a one-dimensional, learnable location embedding is used to preserve the location information.

3) Feature fusion module

For inputting two features x ₁ ,x ₂ First x is set using the linear layer ₁ Projected toThereafter using global average pooling GAP (·), MLP layer F _MLP (. Cndot.) Softmax, partitioning operation to obtain corresponding fusion weights a ₁ ,a ₂ And output y. The overall operation is expressed as the formula:

compared with a mode of directly adding different features, the method of using weighted fusion can better fuse the features from different branches, and further improve the feature extraction capability of the network.

after the foggy image is input into the global structure branch, the foggy image is firstly encoded through a convolution layer 1 to obtain a feature image 1, the feature image 1 is subjected to self-adaptive average pooling to obtain a feature image 2 through 8 times downsampling, the feature image 2 is flattened in a wide-high dimension to obtain a feature image 3, the feature image 3 is subjected to a transducer module to generate a feature image 4, and the feature image 4 is subjected to 8 times upsampling to generate the global structure feature image.

The method can utilize the long-distance perception characteristic of the transducer to acquire the global structural information of the image, and the parameter number and the calculated amount can be greatly reduced by inputting the acquired global structural information into the transducer module after downsampling, so that the real-time operation of a subsequent model is facilitated.

After a fog image is input into a local detail branch, generating a feature map 1 through a convolution layer 1 and a convolution layer 2, generating a feature map 2 through a pixel attention module 1, generating a feature map 3 through a convolution layer 3 and a convolution layer 4 after the feature map 1 is processed, splicing the feature map 3 and the feature map 1 in a channel direction to obtain a feature map 4, generating a feature map 5 through the pixel attention module 2 after the feature map 4 is processed, generating a feature map 6 through the convolution layer 5 and the convolution layer 6 after the feature map 5 is processed, splicing the feature map 6 with the feature map 3 and the feature map 1 in the channel direction to obtain a feature map 7, generating a feature map 8 through the pixel attention module 3 after the feature map 7 is processed, generating a feature map 9 through the convolution layer 7 and the convolution layer 8, splicing the feature map 9 with the feature map 6, the feature map 3 and the feature map 1 in the channel direction to obtain a feature map 10, generating a feature map 11 after the feature map 10 is processed through the convolution layer 8 and the convolution layer 9, and obtaining the local detail branch feature map after the feature map 11 is processed through the pixel attention module 4.

In the process of acquiring the local detail branch feature images, the size of the feature images is kept unchanged all the time, and the detail information of the images is reserved to the greatest extent.

Step five, constructing an image defogging training loss function;

the loss function is formed by weighting and combining an L1 loss function, a structural similarity SSIM loss function and a comparison regularization loss function; compared with the L1 loss or L2 loss function commonly used by other methods, the performance of the network can be improved, and the network is prevented from falling into local optimum.

Comparing an image output by the defogging network with a corresponding real defogging image, calculating loss, wherein the loss function adopts L1 loss, structural similarity SSIM loss and contrast regularization loss, and the specific expression is as follows:

the L1 loss function is expressed as:

compared with the L2 loss function, the L1 loss function has more excellent performance and is less prone to being trapped in local optimum.

The structural similarity SSIM loss is expressed as:

wherein J is an image output by a fog network, GT is a corresponding real fog-free image, mu _J Sum mu _GT Mean value, sigma, of image output by defogging network and defogging image in window _J Sum sigma _GT Respectively representing standard deviation sigma of images output by defogging network and images without defogging in window _JGT The image output by the defogging network and the defogging image are represented by covariance in a window, and the window size is 11 multiplied by 11.C (C) ₁ And C ₂ The values were 0.0001 and 0.0009 for constants, respectively.

The structural similarity measures the similarity between two images and is closely related to human visual perception. Since a larger value of SSIM indicates a more similar relationship between two images, when the two images are identical, the value of SSIM is 1, and thus the SSIM loss function is selected to be 1-SSIM.

The contrast regularization loss is expressed as:

F _CR representing a ratio of a distance between the network output image and the corresponding real foggy image and a distance between the network output image and the corresponding foggy image in the feature space; the function is to shorten the distance between the network output image and the corresponding real fog-free image and to lengthen the network output image and phaseThe distance between the corresponding hazy images.

According to the method, a VGG-19 network is adopted as a network for feature extraction in comparison regularization, and the foggy and foggy images are used for pre-training to further improve the distance between the foggy images and the foggy images in the feature space. And obtaining the comparison regularized output by calculating the distances among three input images in different characteristic layers and weighting and summing. The contrast regularization is only used in the training process, and the training of the network is guided through the loss function, so that the contrast regularization method does not increase the parameters and the calculated amount of the network, and the reasoning of the trained model is not influenced.

Total loss function F _loss The method comprises the following steps:

wherein omega ₁ ,ω ₂ ,ω ₃ Corresponding weights are 1,1,0.8 respectively.

Training an image defogging network model by using image pairs formed by defogging images and foggy images with different concentrations, and optimizing parameters of the image defogging network model through a loss function;

training involves the following process:

1) Aiming at the foggy images in the image pair, performing data enhancement by using a vertical overturning, horizontal overturning and random cutting mode; the size of the image after random clipping is 512×512;

the formula is: j (x) =k (x) I (x) -K (x)

Wherein I (x) represents a currently input foggy image;

compared with the method for performing image defogging by estimating two parameters of atmospheric ambient light and transmissivity, the method can realize image defogging by only estimating one defogging parameter K (x), can greatly reduce accumulated errors caused by estimating the two parameters, and can further simplify the network structure.

the defogging step can realize defogging of the image only by inputting the foggy image and without inputting other information, and can adapt to the input images with different sizes, and the size of the output image is consistent with that of the input image.

Simulation verification

The training data uses the foggy and foggy image pairs synthesized from the static images in the drone target detection dataset VisDrone 2019. The training set is composed of VisDrone 2019 static image training sets, the verification set is composed of VisDrone 2019 static image training sets, the test set is composed of 10 images in the VisDrone 2019 static image test set, each image is composed of 10 hazy images with different haze concentrations, and therefore 64710 training image pairs, 5480 verification image pairs and 100 test image pairs can be obtained in total.

The training images are input into a built image defogging network, an optimizer uses an Adam optimizer, the initial value of the learning rate is set to be 0.0001, a cosine annealing adjustment method is selected by a learning rate adjustment strategy, the learning rate can be gradually reduced to be 0.01 of the initial learning rate in the training process, and the maximum iteration number is set to be 300 rounds.

In the iterative process, the loss function value gradually decreases until the loss function value is stable, and when the loss function value decreases to the lowest and tends to be stable, the network is fully fitted. And after each round is finished, using a verification set to verify the performance of the network, and if the result is better than the best performance of the previous round, storing the current round weight.

The peak signal-to-noise ratio index PSNR of the model on the verification set is 20.81, the structural similarity index SSIM is 0.8211, the peak signal-to-noise ratio index PSNR of the model on the test set is 21.21, and the structural similarity index SSIM is 0.9296. A graph of defogging effect of a partial image of the test set is shown in fig. 6.

Claims

1. The unmanned aerial vehicle image defogging method based on the global and local double-branch networks is characterized by comprising the following specific steps:

firstly, respectively estimating depth maps of all images by using a depth estimation model by using an existing unmanned aerial vehicle image, and generating foggy images with different concentrations by using an atmospheric scattering model; constructing a defogging network model based on global and local double-branch images;

the local detail branch consists of 9 convolution layers and 4 pixel attention modules; the global structure branch consists of a pooling downsampling layer, a convolution coding layer, a transducer module and an upsampling layer; feature fusion module for inputting two features x ₁ ,x ₂ First x is set using the linear layer ₁ Projected toThereafter using global average pooling GAP (·), MLP layer E _MLP (. Cndot.) Softmax, partitioning operation to obtain corresponding fusion weights a ₁ ,a ₂ And output y; the specific calculation formula is as follows:

then, inputting each foggy image in the data set into a global and local double branches at the same time, respectively obtaining a global structural feature image and a local detail feature image, and then inputting the global structural feature image and the local detail feature image into a feature fusion module to obtain defogging parameters K (x) corresponding to each image;

then, constructing an image defogging training loss function which is formed by weighting and combining an L1 loss function, a structural similarity SSIM loss function and a contrast regularization loss function;

wherein, the L1 loss function is expressed as:

the structural similarity SSIM loss is expressed as:

wherein J is an image output by a fog network, GT is a corresponding real fog-free image, mu _J Sum mu _GT Mean value, sigma, of image output by defogging network and defogging image in window _J Sum sigma _GT Respectively representing standard deviation sigma of images output by defogging network and images without defogging in window _JGT Representing the covariance of the image output by the defogging network and the defogging image in the window; c (C) ₁ And C ₂ Is a constant;

the contrast regularization loss is expressed as:

F _CR representing a ratio of a distance between the network output image and the corresponding real foggy image and a distance between the network output image and the corresponding foggy image in the feature space;

total loss function F _loss The method comprises the following steps:

wherein omega ₁ ,ω ₂ ,ω ₃ Is the corresponding weight;

finally, utilizing images formed by the defogging and foggy images to train an image defogging network model, and optimizing parameters of the image defogging network model through a loss function; and inputting an optimal image defogging network model aiming at a new single image to be defogged, and directly outputting defogged images.

2. The defogging method for images of a unmanned aerial vehicle based on a global and local dual-branch network according to claim 1, wherein each defogging image is synthesized into a plurality of defogging images, and each defogging image is combined with the corresponding defogging image into an image pair.

3. The defogging method for the unmanned aerial vehicle image based on the global and local double-branch network as claimed in claim 1, wherein in the local detail branches, the convolution layers 1, 3, 5, 7 and 9 are PW convolutions with the number of convolution kernels being 3; the convolution layer 2 is DW convolution with the convolution kernel size of 3 multiplied by 3 and the number of the convolution kernels of 3; the convolution layer 4 is DW convolution with the convolution kernel size of 5 multiplied by 5 and the number of the convolution kernels of 6; the convolution layer 6 is DW convolution with the convolution kernel size of 7 multiplied by 7 and the number of the convolution kernels of 9; the convolution layer 8 is DW convolution with the convolution kernel size of 3 multiplied by 3 and the number of the convolution kernels of 12;

each pixel attention module consists of one DW convolution and two PW convolutions, wherein a ReLU activation function is set after a first PW convolution layer, and a Sigmoid activation function is set after a second PW convolution layer; the ReLU activation functions are set after the convolution layer 1, the convolution layer 3, the convolution layer 5, the convolution layer 7 and the convolution layer 9; the feature map size remains unchanged during the image's local detail branching.

4. The defogging method based on the global and local double-branch network unmanned aerial vehicle image according to claim 1, wherein in the global structure branch, a pooling downsampling layer is self-adaptive pooling, and a feature map is downsampled by 8 times; the convolution coding layer is a convolution with the size of 3 multiplied by 3 and the number of the convolution kernels being 16;

each transducer module contains a multi-headed attention module and a MLP structure with jump junctions, using LayerNorm normalization and gel activation functions in the module.

5. The defogging method based on the global and local dual-branch network unmanned aerial vehicle images according to claim 1, wherein the feature fusion module calculates global and local feature weights respectively and performs weighted fusion, and convolves the fused features to obtain defogging parameters K (x).

6. A global and local dual-branch network-based unmanned aerial vehicle image defogging method according to claim 1, wherein the training image defogging network model comprises the following procedures:

the formula is: j (x) =k (x) I (x) -K (x)

Wherein I (x) represents a currently input foggy image;

4) Repeating the step 2) until the preset iteration times are reached, and storing the parameter updated last time as the optimal parameter of the image defogging model.