CN115861841A

CN115861841A - SAR image target detection method combined with lightweight large convolution kernel

Info

Publication number: CN115861841A
Application number: CN202211573253.9A
Authority: CN
Inventors: 李钊; 孙晓晖; 许涛; 刘永涛; 田西兰; 杨雪亚; 刘小平; 常沛; 高晶晶; 张玉营; 李玉景; 朱程涛
Original assignee: CETC 38 Research Institute
Current assignee: CETC 38 Research Institute
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-03-28

Abstract

The invention discloses an SAR image target detection method combined with a lightweight large convolution kernel, and belongs to the technical field of target detection of SAR images. The invention provides a light-weight large convolution kernel layer, which ensures that the parameter number of a model is greatly reduced compared with that of a model using a conventional convolution kernel while continuously expanding a receptive field, thereby not only ensuring the detection precision, but also being convenient to arrange on embedded equipment due to the light-weight characteristic; the strategy designed by the invention can learn more characteristics through more branches by using the multi-branch model during training, and can effectively reduce the time loss of calling the memory caused by the multi-branch by using the single-branch model during testing.

Description

SAR image target detection method combined with lightweight large convolution kernel

Technical Field

The invention relates to the technical field of SAR image target detection, in particular to an SAR image target detection method combining a lightweight large convolution kernel.

Background

China aerospace reconnaissance equipment has the capabilities of multiple sources, multiple bands, multiple modes, multiple applications and high-resolution ground imaging. How to realize the processing and the decoding of the image information in the embedded platform more quickly and effectively becomes a problem to be solved. Although the existing neural network model can screen effective available information from the SAR image, the parameter quantity of the existing neural network model is often large and cannot be efficiently transplanted to an embedded platform; meanwhile, the large number of model parameters can influence the message generation speed, and the high timeliness required by investigation cannot be met.

At present, the lightweight algorithm based on the neural network is mainly divided into a plurality of branches: the method comprises the specific steps that a large model is used for training, and then the small model is enabled to have the characterization capability close to that of the large model through means such as control variables and loss functions; the second method is that through a Neural Architecture Search (NAS) algorithm, a Search space and a strategy are defined, candidate models which meet conditions are found out in the space according to the Search strategy, then evaluation is carried out respectively, and the next round of Search is carried out according to evaluation feedback; and the third method is to design a lightweight model manually, and realize the effect of reducing the model parameters under the condition of ensuring the accuracy rate through the optimization of some modules.

Current knowledge distillation and neural network architecture search present certain problems on embedded devices. Knowledge distillation is difficult to realize, a teacher network needs to be additionally designed, the distillation difficulty is inconsistent aiming at different tasks, and the small distilled model cannot be guaranteed to be usable certainly; the neural network architecture search needs to consume large computing resources, and the designed model is poor in interpretability. Therefore, the model lightweight is a method which is more suitable for loading the neural network model on the embedded equipment under the current environment.

The model lightweight design is always proposed as a preferred lightweight scheme for carrying models on some mobile terminal equipment, in particular, deep separable convolution is taken as a common model lightweight method, the common version of the model lightweight method, namely the MobileNet series, has been developed to the third generation so far, and the performance and the lightweight degree of the model lightweight method are well-praised by the academic world and the industry.

Neural networks have an inherent principle: the larger the receptive field is, the more global information can be acquired, and a series of downstream tasks such as target detection, semantic segmentation, posture recognition and the like are promoted. However, a series of means for expanding the convolution kernel proposed by previous researches, such as deep separable convolution, often use a convolution kernel that is not "large", such as 3*3 and ordinary convolution with a void ratio of 3, and its receptive field is 5.

Based on the reasons, the transplantation of the SAR image target detection model on the embedded equipment is challenged, and therefore, the SAR image target detection method combining with the light-weight large convolution kernel is provided.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: how to solve the problem of light weight of embedded equipment when carrying an SAR image target detection model, reducing the parameter quantity of the model and reasoning time under the condition of ensuring the identification precision, and providing the SAR image target detection method combining a light weight large convolution kernel.

The invention solves the technical problems through the following technical scheme, and the invention comprises the following steps:

s1: data set collection and production

Collecting an SAR image, preprocessing the SAR image to generate an SAR image data set, and dividing the SAR image data set into a training set and a test set;

s2; model training

Training the model by utilizing the SAR image in the training set to obtain a trained SAR image target detection model;

s3: model reasoning

And after the model is trained, inputting the SAR image of the test set into the SAR image target detection model for reasoning to obtain a detection result.

Further, in the step S1, the preprocessing includes quantization, normalization, data amplification and cropping of the SAR image.

Furthermore, in step S2, the SAR image target detection model includes a backbone network, a multi-scale feature interaction structure, and three detection heads, where the backbone network and the multi-scale feature interaction structure are connected, and the multi-scale feature interaction structure is connected with the three detection heads respectively.

Furthermore, the backbone network comprises an initialization layer, three multi-branch large convolution kernel blocks, three conversion layers and a spatial pyramid pooling layer; the multi-scale feature interaction structure comprises two upper sampling layers with attention mechanisms and two lower sampling layers with attention mechanisms, and the detection head comprises a common convolution layer and a dimension conversion module; processing an input SAR image into a feature map with a set size through an initialization layer, sequentially and continuously inputting three groups of multi-branch large convolution kernel blocks and conversion layers, passing through one conversion layer, respectively reducing the length and the width of the feature map by half, multiplying the number of channels by one time, passing through three conversion layers, generating feature maps with three sizes, marking as t1, t2 and t3, inputting the feature map t3 into a space pyramid pooling layer to obtain high-level semantic information s3, splicing the space pyramid pooling layer with two paths of feature maps output by the last conversion layer to form a feature map ts3, splicing the feature map ts3 with an upper sampling layer with an attention mechanism to form a feature map up2, forming the feature map up1 with another upper sampling layer with the attention mechanism, splicing the feature maps up1, t2 and up2 to form a feature map d2 through a lower sampling layer with the attention mechanism, splicing the feature map d2 with the feature map ts3 to form a feature map d3 through another upper sampling layer with the attention mechanism, respectively performing multi-branch feature map detection and conversion on three groups of feature maps by using interaction scale, respectively, and three groups of feature maps, respectively input and input into three groups of common feature maps, and input and conversion modules, and input into a common scale.

Furthermore, the initialization layer comprises two common convolutional layers and two depth separable convolutional layers with the grouping number of two, wherein the first common convolutional layer, the first depth separable convolutional layer, the second common convolutional layer and the second depth separable convolutional layer are connected in sequence.

Furthermore, the multi-branch large convolution kernel block comprises three common convolution layers, two GELU activation and BN layers and a large kernel convolution block, wherein the first common convolution layer, the first GELU activation and BN layer, the large kernel convolution block and the second common convolution layer are sequentially connected to form a main path; the input characteristic diagram is directly superposed on the output end to form a branch; and simultaneously, the input characteristic diagram passes through a common convolution layer, and then the characteristic diagram obtained after the second GELU activation and BN layer is superposed on the output end to form another branch.

Furthermore, the common convolutional layer of the present invention refers to a convolutional layer with a convolutional kernel size of 3*3 or 1*1, and this type of convolutional layer is commonly used in various neural network models and is called a common convolutional layer.

Furthermore, the large-kernel convolution block comprises a depth separable convolution layer, a cavity convolution layer and a common convolution layer which are sequentially connected, the convolution kernel size of the large-kernel convolution block is larger than 5*5, and the convolution layer with a larger receptive field is realized by superposition of feature maps of a plurality of different receptive fields.

Furthermore, the conversion layer adopts a sub-pixel sampling strategy, the sub-pixel sampling process is to split the feature map with the dimension of H, W and C into small blocks consisting of a plurality of grids, the small blocks are spliced into a plurality of sub-feature maps according to the position of each small block, and the channel number of the small blocks is output after dimension reduction through a common convolution layer.

Furthermore, the upper sampling layer with the attention mechanism expands the size of the lower characteristic graph to be twice of the size through neighborhood interpolation, then is spliced with the characteristic graph with the same size, and finally is input into the attention layer to carry out channel level attention weighting; the method comprises the steps that firstly, a common convolutional layer is used for reducing the length and the width by half for the downsampling layer with the attention mechanism, then the downsampling layer is spliced with a characteristic map of the same level, finally, an attention layer is input to carry out channel level attention weighting, and the attention layer realizes attention weighting in a mode of coexistence of channel attention and axial attention.

Further, in step S3, during inference, the multi-branch model in the multi-branch large convolution kernel block is converted into a single-branch model, and the specific conversion mode is as follows: the parameters obtained after training a multi-branch large convolution kernel block formed by a large kernel convolution block serving as a main path, a second GELU activation and BN layer serving as a branch and an input characteristic graph serving as a second branch are reserved, two branches are expanded into convolution kernel parameters equivalent to the receptive field 11 x 11 due to the expandability of the convolution kernels, the sizes of the convolution kernels of the branch where the main path and the second GELU activation and BN layer are located and the branch where the input characteristic graph is located are consistent, and finally the convolution kernels are added in an add mode.

Compared with the prior art, the invention has the following advantages:

1. the target detection algorithm of the SAR image by using the larger convolution kernel (11 x 11) is realized, a light-weight large convolution kernel layer is provided, and the receptive field is expanded continuously, the model parameters are greatly reduced compared with the model using the conventional convolution kernel, so that the detection precision is ensured, and the model is conveniently arranged on the embedded equipment due to the light weight characteristic.

2. The strategy designed by the invention can learn more characteristics through more branches by using the multi-branch model during training, and can effectively reduce the time loss of calling the memory caused by the multi-branch by using the single-branch model during testing.

Drawings

FIG. 1 is an overall architecture diagram of a SAR image target detection model incorporating a lightweight large convolution kernel in an embodiment of the present invention;

FIG. 2a is a schematic structural diagram of an initialization layer in an embodiment of the invention;

FIG. 2b is a diagram illustrating the structure of a multi-branch large convolution kernel block according to an embodiment of the present invention;

FIG. 2c is a diagram illustrating a structure of a large kernel volume block according to an embodiment of the present invention;

FIG. 3a is a schematic flow chart of an implementation of a spatial pyramid pooling layer in an embodiment of the present invention;

FIG. 3b is a schematic diagram of an upsampling layer with an attention mechanism in an embodiment of the present invention;

FIG. 3c is a schematic diagram of a downsampling layer with attention mechanism in an embodiment of the present invention;

FIG. 3d is a schematic structural diagram of an attention layer in an embodiment of the present invention;

FIG. 4 is a flow chart illustrating an implementation of a conversion layer with sub-pixel sampling according to an embodiment of the present invention;

fig. 5 is a schematic flow chart of an implementation of converting a multi-branch module into a single-branch module according to an embodiment of the present invention.

Detailed Description

The following examples are given for the detailed implementation and specific operation of the present invention, but the scope of the present invention is not limited to the following examples.

The embodiment provides a technical scheme: a SAR image target detection method combining a lightweight large convolution kernel comprises three steps of data collection and sorting, model training and reasoning, and specifically comprises the following steps:

(1) Collection and production of data sets: the SAR image data set is collected by considering both a large target and a small target, and after the SAR image data set is collected, the image is cut into 640 × 640 size according to a series of preprocessing modes (including quantization, data amplification, cutting and the like) so as to adapt to image training;

(2) Training of the model: the cut training set is trained by using the SAR image target detection algorithm combined with the lightweight large convolution kernel, the algorithm uses a model formed by large-scale convolution kernels, and a smart hierarchical design is used, so that the model maintains a large receptive field and reduces parameter quantity as much as possible. Meanwhile, the invention provides that a sub-pixel down-sampling layer and a multi-branch structure are combined during training to ensure that the model can obtain more effective information during training;

(3) Reasoning of the model: the trained model may discard the multi-branch structure and use the single-branch structure when reasoning. The invention uses a skillful means to restore the multi-branch reasoning characteristic diagram to the single branch, ensures that the model learns sufficient characteristics, and simultaneously shortens the time for keeping the current characteristic diagram when the model calls commands such as add (add) and splice (match) because of the reduction of the number of the branches, thereby effectively reducing the reasoning time and ensuring that the model maintains high real-time performance on the embedded equipment.

The preprocessing method in step (1) includes a series of preprocessing methods commonly used in image processing and neural networks, including but not limited to: SAR image quantization, normalization, data amplification and clipping. The data amplification method has various methods, including but not limited to, translation and rotation from left to right and up and down, horizontal turnover from left to right and up to down, gray contrast transformation, affine transformation, random target pasting and the like.

As shown in fig. 1, the SAR image target detection model combined with the lightweight large convolution kernel proposed in step (2) includes the following parts: a backbone network (backbone), a multiscale feature interaction structure (neck) and a detection head (detect head). The main improvement direction of the invention is that a multi-branch network formed by stacking lightweight large convolution kernels is used in the trunk, so that richer receptive fields are ensured; meanwhile, the common means of pooling, large-step convolution and the like of a conversion layer (transition) are replaced by the sub-pixel down-sampling provided by the invention, so that more effective information can be ensured. The backbone network includes an initialization layer (init), a multi-branch large convolution kernel (stage), a transformation layer (transition), and a Spatial Pyramid Pooling (SPP). In the multi-scale feature interaction structure part, the invention designs an up-sampling layer and a down-sampling layer with attention mechanism, so that the model can fully utilize effective information of different scales. Finally, the invention uses the double-branch detection head to perform regression on the category and the coordinate by combining and comparing the category and the coordinate respectively, thereby avoiding the problem of mutual coupling of regression objects caused by a single-branch detection head.

The specific detection flow of the target detection algorithm provided by the invention is as follows:

1. inputting the images uniformly cropped to 640 x 3 into the model;

2. backbone network: after the initialization layer, the feature size is 160 × 32, and then the feature size is sequentially input into the multi-branch large convolution kernel block and the inversion layer, the feature size is respectively reduced by half and the channel number is doubled every time the inversion layer passes. After three times of conversion layers, the sizes of characteristic graphs are respectively 80 × 64,40 × 128 and 20 × 256, and are respectively marked as t1, t2 and t3; and then, inputting the feature maps t3 of 20 × 256 into the spatial pyramid pooling layer to obtain more effective high-level semantic information s3, and then splicing the two feature maps to form the feature map ts3 with the size of 20 × 512.

3. Multi-scale feature interaction structure: ts3 then passes through the upsampling layer with attention (up-attention) to form feature map up2 with dimension 40 x 128, and then passes through the upsampling layer with attention once to form feature map up1 with dimension 80 x 64. Then, after the feature map up1 is spliced with t2 and up2, a feature map d2 is formed through a down-sampling layer (down-attention) with an attention mechanism, and after the feature map d2 is spliced with a feature map ts3, a feature map d3 is formed through a down-sampling layer with an attention mechanism. The feature maps up1, d2 and d3 are respectively detection feature maps with different scales and are used as the output of the multi-scale feature interaction structure;

4. a detection head: the three feature maps of the feature maps up1, d2 and d3 with different scales are respectively input into three detection heads, and the detection heads predict the category and the position through two branches.

As shown in fig. 2a, the initialization layer structure used in step (2) contains four convolutional layers: one layer of normal 3*3 convolutional layer, one layer of normal 1*1 convolutional layer, two layers of depth separable convolutional layer with the group number of 2. After the series of operations, the length and the width of the feature map are reduced by two times respectively.

As shown in fig. 2b, the multi-branch Large Convolution Kernel Block in step (2) includes a head layer 1*1 Convolution layer, a GELU activation and BN (Batch Normalized) layer, and a Large Kernel Convolution Block (LKCB) layer. And simultaneously, the system also comprises two branches, wherein one branch is used for directly superposing the input characteristic diagram at the output end, and the other branch is used for superposing the input characteristic diagram at the output end through a common 1*1 convolution layer and then through a GELU activation layer and a BN layer to obtain a characteristic diagram, so that the multi-branch large convolution kernel block not only contains a high-efficiency characteristic diagram obtained from a very large receptive field, but also keeps a relatively shallow characteristic diagram.

As shown in fig. 2c, the Large Kernel Convolution Block (LKCB) involved in step (2) is a convolution layer in which a large receptive field is realized by superimposing feature maps of a plurality of different receptive fields. In order to realize the convolution layer with the receptive field of 11, the invention uses the following hierarchy superposition scheme: the first layer is a depth separable convolutional layer with a convolutional kernel size of 5, the second layer is a void convolutional layer with a convolutional kernel size of 5 and a void ratio of 3, and the third layer is a common 1*1 convolutional layer. The scheme greatly reduces the parameter number while considering a larger receptive field, and the parameter number is calculated as follows:

in a conventional 11 × 11 convolutional layer, if the number of input channels is M and the number of output channels is N, the parameter number of the single convolutional layer is 11 × M × N, i.e. 121MN.

If, according to the currently popular scheme, multiple layers 3*3 are used in a convolutional cascade of 11 × 11 receptive fields, then 5 layers 3*3 convolutional layer cascades are required, with a parameter of 3 × m × n × 5, i.e., 45MN.

If a void convolution with a convolution kernel size of 5 and a void fraction of 3 is used, then 2 stacks are required and an additional layer of 1*1 convolution is used with a parameter of 3 × m × n × 2+1 × m × n, i.e., 19MN.

If LKCB is proposed according to the present invention, a deep separable convolution layer with convolution kernel size of 5 is required, with a parameter number of 5 × M + M × N; a layer of void convolution layers having a convolution kernel size of 5 and a void ratio of 3, with a parameter number of 3 x m x n; one layer of a conventional 1*1 convolution with a parameter of 1 × m × n. Thus, the total reference number is (25 + 11N) M.

In the model, the minimum value of N is 64, if the calculation is carried out according to the minimum value, 7744M is needed when a common 11 x 11 convolutional layer is used, 2880M is totally used when a 3*3 convolutional layer is used, 1216M is used when a convolutional kernel size is 5 and a void ratio is 3, and 729M is needed when the LKCB provided by the invention is used, the advantages that the solutions are fewer than all the schemes, and the parameter quantity of the LKCB is less along with the continuous increase of N can be further reflected, so that convenience is provided for carrying the SAR image target detection model on the embedded device.

As shown in fig. 3a, the SPP layer involved in step (2) is a typical spatial pyramid pooling layer, and the present invention performs some improvements while preserving pooling, and the input feature map performs pooling operations with pooling scales of 5 × 5,9 × 9, 13 × 13, respectively, and conventional 3*3 convolution operation, then performs stitching, and finally reduces the number of channels to 256 by a conventional 1*1 convolution layer and outputs.

As shown in fig. 4, the conversion layer involved in step (2) aims to reduce the length and width dimensions of the feature map. The invention adopts a sub-pixel sampling strategy, the sub-pixel sampling process is to split a feature map with the scale of H x W C into small blocks consisting of 2*2 grids, 4 sub-feature maps, namely H/2*W/2 x 4c, are spliced according to the position of each small block, and then the channel number of the small blocks is reduced through a layer of common 1*1 rolling layers, and the final output size is H/2*W/2 x 2c.

The structure of the up-sampling layer and the down-sampling layer with the attention mechanism involved in the step (2) is shown in fig. 3b and 3c, the up-sampling layer expands the size of the feature map of the lower layer to twice through neighborhood interpolation, then the feature map is spliced with the feature map with the same size, and finally the feature map is input into the attention layer to carry out channel-level attention weighting; the down-sampling layer firstly uses the common convolution layer with the step length of 2 to reduce the length and the width by half, then is spliced with the characteristic diagram of the same level, and finally is input into the attention layer. With respect to the attention mechanism referred to herein, the attention layer of the present invention employs a combination of channel attention and axial attention as shown in FIG. 3 d. Axial attention carries out axial average pooling and maximum pooling on the input feature map, then splicing is carried out, an axial weight distribution map is obtained after a layer of 7*7 common convolution layer, and then the axial weight distribution map is multiplied with the input feature map to obtain an axial feature map; the channel attention is also characterized in that feature maps are screened by adopting a weight layer after average pooling and maximum pooling, a level attention feature map is learned through two layers of full link layers (MLP), and the learned feature map is multiplied by an input feature map to obtain a feature map in the channel direction; finally, splicing the feature maps in the channel direction and the axial direction, and then using a common 1*1 convolution layer to reduce the dimension to be the same as the dimension of the input feature map to be used as a final output feature map.

And (3) adopting a design idea similar to a YOLO series in the last detection head in the step (2), and outputting three groups of feature maps with different scales by the model, wherein the feature maps are 80 × 256, 40 × 256 and 20 × 256 respectively. The invention designs a double-branch detection head, wherein two branches respectively use 1*1 convolution to divide a feature map of each scale into two branches, dimension transformation is carried out according to 1*1 convolution layers with channel numbers of 3N,12 and 3 (3 represents that each point can regress anchor frames with three sizes, N is a target type, 12=3 + 4 represents xy coordinates of the upper left corner and the lower right corner of a target, and 3=3 +1 represents the intersection ratio of the target and a true value), so that the target types, the coordinates and the intersection ratio represented by three groups of feature maps are predicted.

The method for converting the multi-branch model in the multi-branch large convolution kernel block in step (3) into the single-branch model is shown in fig. 5, and firstly, parameters obtained after training the left middle branch of fig. 5, i.e., the GELU activation + BN branch, and the right 1*1 convolution + BN + GELU layer branch, are retained, and because of the scalability of the convolution kernel, the parameters are expanded to be convolution kernel parameters equivalent to the receptive field 11 × 11 (equivalent to the receptive field of 11 × 11, whether the size is 3*3 convolution and 1*1 convolution overlap), so that the sizes of the convolution kernels of the middle and right branches are consistent, and the convolution kernels can be added in an add manner, which is a multi-branch to single-branch strategy during inference. Note that all parameters of the model are convolution kernel parameters, not bias (bias). The method can effectively retain the characteristics of the trained large convolution kernel model, can efficiently transfer the memory, and shortens the time for transferring the memory to store the current weight when transferring commands such as addition, splicing and the like, thereby accelerating the reasoning time.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A SAR image target detection method combined with a lightweight large convolution kernel is characterized by comprising the following steps:

s1: data set collection and production

s2; model training

s3: model reasoning

2. The SAR image target detection method combined with the lightweight large convolution kernel as recited in claim 1, characterized in that: in the step S1, the preprocessing includes quantization, normalization, data amplification, and clipping of the SAR image.

3. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 1, characterized in that: in the step S2, the SAR image target detection model includes a backbone network, a multi-scale feature interaction structure, and three detection heads, where the backbone network and the multi-scale feature interaction structure are connected, and the multi-scale feature interaction structure is connected with the three detection heads respectively.

4. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 3, characterized in that: the main network comprises an initialization layer, three multi-branch large convolution kernel blocks, three conversion layers and a spatial pyramid pooling layer; the multi-scale feature interaction structure comprises two upper sampling layers with attention mechanisms and two lower sampling layers with attention mechanisms, and the detection head comprises a common convolution layer and a dimension conversion module; processing an input SAR image into a feature map with a set size through an initialization layer, sequentially and continuously inputting three groups of multi-branch large convolution kernel blocks and conversion layers, passing through one conversion layer, respectively reducing the length and the width of the feature map by half, multiplying the number of channels by one time, passing through three conversion layers, generating feature maps with three sizes, marking as t1, t2 and t3, inputting the feature map t3 into a space pyramid pooling layer to obtain high-level semantic information s3, splicing the space pyramid pooling layer with two paths of feature maps output by the last conversion layer to form a feature map ts3, splicing the feature map ts3 with an upper sampling layer with an attention mechanism to form a feature map up2, forming the feature map up1 with another upper sampling layer with the attention mechanism, splicing the feature maps up1, t2 and up2 to form a feature map d2 through a lower sampling layer with the attention mechanism, splicing the feature map d2 with the feature map ts3 to form a feature map d3 through another upper sampling layer with the attention mechanism, respectively performing multi-branch feature map detection and conversion on three groups of feature maps by using interaction scale, respectively, and three groups of feature maps, respectively input and input into three groups of common feature maps, and input and conversion modules, and input into a common scale.

5. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 4, characterized in that: the initialization layer comprises two common convolution layers and two depth separable convolution layers with the grouping number being two, wherein the first common convolution layer, the first depth separable convolution layer, the second common convolution layer and the second depth separable convolution layer are connected in sequence.

6. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 5, characterized in that: the multi-branch large convolution kernel block comprises three common convolution layers, two GELU activation and BN layers and a large kernel convolution block, wherein the first common convolution layer, the first GELU activation and BN layer, the large kernel convolution block and the second common convolution layer are sequentially connected to form a main path; the input characteristic diagram is directly superposed on the output end to form a branch; and simultaneously, the input characteristic diagram passes through a third common convolution layer, and is superposed on the characteristic diagram obtained after the second GELU activation and BN layer to form another branch circuit at the output end.

7. The SAR image target detection method combined with the lightweight large convolution kernel as recited in claim 6, characterized in that: the large-kernel convolution block comprises a depth separable convolution layer, a cavity convolution layer and a common convolution layer which are sequentially connected, the convolution kernel size of the large-kernel convolution block is larger than 5*5, and the convolution layer with a larger receptive field is realized by superposition of feature maps of a plurality of different receptive fields.

8. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 7, characterized in that: the conversion layer adopts a sub-pixel sampling strategy, the sub-pixel sampling process is to split the feature graph with the dimension of H x W x C into small blocks consisting of a plurality of grids, the small blocks are spliced into a plurality of sub-feature graphs according to the position of each small block, and then the channel number of the small blocks is output after dimension reduction through a common convolution layer.

9. The SAR image target detection method combined with the light-weight large convolution kernel according to claim 8, characterized in that: the upper sampling layer with the attention mechanism expands the size of the lower characteristic graph to double the size of the lower characteristic graph through neighborhood interpolation, then splices the lower characteristic graph with the characteristic graphs with the same size, and finally inputs the lower characteristic graph into the attention layer to carry out channel level attention weighting; the method comprises the steps that firstly, a common convolutional layer is used for reducing the length and the width by half for the downsampling layer with the attention mechanism, then the downsampling layer is spliced with a characteristic map of the same level, finally, an attention layer is input to carry out channel level attention weighting, and the attention layer realizes attention weighting in a mode of coexistence of channel attention and axial attention.

10. The method for detecting the SAR image target combined with the light-weight large convolution kernel as claimed in claim 1 or 9, characterized in that: training by using a multi-branch large convolution kernel block model in the training process of the step S2; in step S3, during inference, the multi-branch model in the multi-branch large convolution kernel block is converted into a single-branch model, and the specific conversion mode is as follows: the parameters obtained after training a multi-branch large convolution kernel block formed by a large kernel convolution block serving as a main path, a second GELU activation and BN layer serving as a branch and an input characteristic graph serving as a second branch are reserved, two branches are expanded into convolution kernel parameters equivalent to the receptive field 11 x 11 due to the expandability of the convolution kernels, the sizes of the convolution kernels of the branch where the main path and the second GELU activation and BN layer are located and the branch where the input characteristic graph is located are consistent, and finally the convolution kernels are added in an add mode.