CN113469073A

CN113469073A - SAR image ship detection method and system based on lightweight deep learning

Info

Publication number: CN113469073A
Application number: CN202110765081.4A
Authority: CN
Inventors: 陈潇钰; 侯彪; 焦李成; 张丹; 马文萍; 马晶晶; 王爽
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-10-01
Anticipated expiration: 2041-07-06
Also published as: CN113469073B

Abstract

The invention discloses an SAR image ship detection method and system based on lightweight deep learning, which comprises the steps of preprocessing a large-size SAR image, and selecting a training sample; introducing a Ghost module and Ghost Bottleneck to upgrade YOLOv5s to obtain a primary lightweight model of YOLOv5 s; on the basis of the primary lightweight model, further lightweight of the model is realized by using network pruning and knowledge distillation of a traditional model lightweight algorithm; carrying out reasoning acceleration on the light-weighted Yolov5s model by using a TensorRT reasoning optimizer and deploying the model on NVIDIA Jetson TX 2; cutting large-size SAR images to be detected, and sequentially sending the cut large-size SAR images to a model to complete detection; and synthesizing the detection result, and using NMS non-maximum value to inhibit, screen and predict the frame on the final large-size SAR image. On the premise of meeting the acceptable precision loss, the parameter number and the floating point operand of the compression model improve the detection speed.

Description

SAR image ship detection method and system based on lightweight deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to an SAR image ship detection method and system based on lightweight deep learning.

Background

Alexnet appeared in 2012, and the application trend of deep convolutional neural network was raised in the field of computers. Deeper models, which often means that the models have better nonlinear expression capabilities, can perform more complex transformations, and thus can fit more complex features. Based on such an assumption, the deep convolutional neural network develops towards a deeper direction and a wider direction, and although the deep convolutional neural network shows more excellent performance in various tasks, the network model has a larger volume, which is contrary to the hardware conditions of various embedded devices at the current mobile end, and each result of the deep neural network research can only be high-level and cannot fall to the ground. The development speed of the deep neural network is equivalent to that of various mobile terminal devices, the devices usually do not have a high-performance computing cluster of a Graphic Processing Unit (GPU), only a Central Processing Unit (CPU) completes a computing task, and cannot provide a storage space and a computing condition matched with a large convolutional neural network for extracting depth features with stronger expression capability at the present stage, which seriously hinders the development and application of the deep convolutional upgrading network on portable devices. In order to greatly promote falling of the artificial intelligence industry, a large number of scholars in the academic world and the industrial industry invest in the research of a network model lightweight algorithm so as to improve the performance and the efficiency of the portable equipment in the aspect of image processing.

The existing methods for lightening the network can be mainly divided into two categories: model compression and compact model design. The model compression refers to compressing the neural network model according to the model structure and parameters, so that the requirements of the model on storage equipment and computing resources are reduced, and the portable memory and computing power limitation requirements of a mobile terminal are met. The model compression is oriented to the redundant part of the network structure and the network weight, and the accuracy is sacrificed to a certain extent to obtain a model with less redundancy, higher speed and more simplification. The algorithms proposed at present include NetWork Pruning (NetWork Pruning), Model Quantization (Model Quantization), Binarization Method (Binarization Method), Low-rank Decomposition (Low-rank Decomposition), Knowledge Distillation (Knowledge Distillation), and the like. Because the redundancy degrees of all layers of the deep neural network are different, the conventional model compression algorithm is usually overfitting to a specific model, and if the model compression algorithm which is suitable for the redundancy degrees of all layers of each model is manually searched for each model, time and labor are wasted, so that the development of an automatic machine learning algorithm (AutoML) is promoted, the automatic learning and searching of local optimal network hyper-parameters and structures are automatically performed, the manual interference is avoided, and meanwhile, the automatic model compression algorithm can be popularized to all models. Based on AutoML, the university of Western-An transportation and Google research team, an automatic model compression Algorithm (AMC) is provided, reinforcement learning is introduced into the model compression algorithm, and compared with a traditional rule-based compression strategy, the compression ratio is higher under the condition of keeping the network model performance. A series of compact models like Xception, MobileNetV1, MobileNetV2, MobileNetV3, ShuffleNet, ShuffleNetv2, etc. have also been proposed in recent years. These network models generally start from the point of view of reducing the redundancy of the convolution kernel, compressing the number of channels, and replacing the traditional convolution with an efficient convolution module. The small convolution kernels are used in the convolution layer to reduce the redundancy of the convolution kernels, so that the network parameters are effectively reduced. The Fire module proposed in the SqueezeNet consists of a squeeze layer and an expanded layer, and the number of input channels of a 3 × 3 convolution kernel is further reduced by reducing the number of 1 × 1 convolution kernels in the squeeze layer. The conventional Convolution is decomposed into a depth Convolution (Depthwise Convolution) and a point Convolution (poithwise Convolution) in the MobilenetV1 by using a depth Separable Convolution (Depthwise Separable Convolution); the Shuffle net further proposes a scrambling (Shuffle) operation and a grouping point-by-point convolution (grouping point convolution), rearranges the features, so that the feature information circulates in each channel group; MobileneetV 2 proposes an Inverted residual block (Inverted residual block), MobileneetV 3 uses a neural network architecture search technology (NAS), introduces an SE (squeeze and excitation) module, and further compresses a model of a network structure by selecting an H-swish activation function. These excellent lightweight network models have achieved good results in model compression and acceleration with a small loss of accuracy.

The target detection is also called target category detection or target classification detection, and returns the category information and the position information of the interested target in the image. In the last two decades, the method is a research hotspot in the fields of computer vision and digital image processing. Alexnet proposed in 2012, which was previously based on traditional manual feature-based target detection methods, as is well known: V-J detection, HOG detection, DPM detection combined with Bounding box regression. After 2012, with the rise of a convolutional neural network and exponential increase of GPU performance, deep learning is developed explosively, target detection also enters a deep learning period, a preselection frame is generated according to whether an algorithm is needed, and the target detection algorithm based on the deep learning can be divided into a single-stage (One-stage) detection algorithm and a Two-stage (Two-stage) detection algorithm. The representative networks in the single-stage detection algorithm include YOLO series, SSD and RetinaNet. The method is mainly characterized by low detection precision and high detection speed. Typical networks for the two-stage detection algorithm are R-CNN, SPP-Net, Fast R-CNN and Faster R-CNN. Unlike single-stage detection algorithms, two-stage detection has high detection accuracy but high time cost. Until now, the most excellent target detection algorithm is still difficult to compare favorably with the detection of human eyes. Current target detection still faces a number of challenges. Aiming at the requirement of high accuracy, the diversity caused by the texture, color and material of the similar object; diversity of target instance poses and deformations; the difference of the sampling process environment and the influence of image noise influence the robustness of the algorithm to the intra-class deformation. As for class-to-class distinctiveness, this is generally determined by the similarity between classes and the diversity of classes. Aiming at the requirements of time and memory occupation high efficiency, the richness of natural categories, the duality of positioning and classification of target detection tasks and the increasingly huge volume of image data provide higher requirements for the current target detection algorithm, and the method is also the field of ascending of each large study learner.

High-resolution image target detection based on big data is always a popular research direction in the field of remote sensing image processing, the traditional target detection and identification method cannot be adjusted in a self-adaptive manner aiming at mass data of remote sensing images, a large number of image characteristics need to be designed artificially, great time cost is brought, meanwhile, extremely high requirements are provided for researchers on professional knowledge and understanding of the data characteristics, and an efficient classifier is searched to fully understand the data as if the data is fished out in the sea. And the powerful high-level (more abstract and semantic meaning) feature representation and learning capability of deep learning can provide an effective framework for target extraction in the image. Related researches comprise vehicle detection, ship detection, crop detection, and ground object detection of buildings.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a SAR image ship detection method based on lightweight deep learning aiming at the defects in the prior art, and a lightweight target detection network is deployed in an embedded device NVIDIA Jetson TX2 to realize large-size SAR image ship detection. And (3) taking the target detection network YOLOv5 as a baseline network, and combining a traditional model compression algorithm and a Ghost light-weight module to realize light weight of the baseline network.

The invention adopts the following technical scheme:

a SAR image ship detection method based on lightweight deep learning comprises the following steps:

s1, preprocessing the large-size SAR image, and selecting a subgraph containing target information as a training sample;

s2, a Ghost module and a Ghost Bottleneck are introduced to upgrade the YOLOv5S model to obtain a primary lightweight YOLOv5S model, and the training sample selected in the step S1 is used for training the YOLOv5S model;

s3, distilling the YOLOv5S model obtained after training in the step S2, then performing sparseness training and pruning, and performing fine tuning training on the YOLOv5S model after pruning;

s4, carrying out inference acceleration on the YOLOv5S model after fine tuning training in the step S3 by using a TensorRT inference optimizer, and deploying the model on NVIDIA Jetson TX 2;

s5, cutting the SAR image to be detected, sequentially sending the cut SAR image to a step S4, deploying the YOLOv5S model on NVIDIA Jetson TX2, and detecting to obtain a corresponding sub-graph detection result;

and S6, splicing the sub-graph detection results obtained in the step S5, using an NMS non-maximum value to inhibit and screen a prediction frame on the final large-size SAR image, drawing the prediction frame on the original large-size image according to the value of the screened prediction frame, and marking the category, thereby realizing the SAR image ship detection.

Specifically, step S1 specifically includes:

s101, dicing 5 single-channel TIF images Img10K and 31 single-channel TIFF images AIR-SARShip-1.0 at a coincidence rate of 50% to obtain sub-images of large-size remote sensing images;

s102, amplifying 1000 8-bit JPG images SAR-train-int;

s103, unifying the Img-10K, AIR-SARShip-1.0 obtained in the step S101 and the SAR-train-int image obtained in the step S102 into an 8-bit single-channel TIF image to obtain a data set comprising 2551 pictures, dividing 2351 pictures into training samples and 200 pictures into verification samples;

and S104, performing random operation on the training sample in the step S103 by using a Mosaic data enhancement algorithm, and splicing every four pictures in the training sample in a random scaling, random cutting and random arrangement mode.

Specifically, step S2 specifically includes:

s201, replacing a convolution module and a bottleneck module in a backbone network of a YOLOv5S model by using a Ghost module and a GhostBottleneck, and upgrading the YOLOv5S model by using the Ghost module and the GhostBottleneck;

s202, adjusting the width multiplier to be 0.15, adjusting the depth multiplier to be 0.35, and reducing the number of network layers to 212 layers to obtain a primary lightweight YOLOv5S model.

Specifically, step S3 specifically includes:

s301, using YOLOv5m as a teacher model, using L2 loss as a distillation basis function, selecting a distillation dist equilibrium coefficient in loss as 1, and carrying out distillation training for 100 epochs;

s302, after a hyperparameterized model is obtained through normal training, setting a sparse parameter to be 6e-4, conducting L1 regularization on gamma parameters of a BN layer through sparse training, generating a sparse weight matrix as a standard for evaluating the contribution of neurons, determining a threshold according to 30% sparse rate, cutting off a layer smaller than the threshold and a dependent layer of a corresponding layer, and if all channels in the corresponding layer need to be removed, keeping the largest channel;

and S303, after the pruning processing in the step S302 is finished, continuously training the model obtained in the step S302 for 50 epochs, and learning the final weight of the sparse connection through fine tuning training.

Specifically, in step S4, the deployment of the TensorRT inference optimizer includes a Build stage and a Deploymeng stage, which specifically includes:

s401, optimizing at the Build stage by using a Pythrch training model to obtain a pt file, converting the pt file into an onnx model, loading the onnx model in TensorRT, and converting the onnx model into a TensorRT model; then the TensorRT model is stored in a disk or a memory in a serialized mode and is called a plan file;

s402, deploying a lightweight YOLOv5 model in a Deployment stage, deserializing the plan file obtained in the step S401, creating a runtime engine, and completing a forward reasoning process.

Specifically, step S5 specifically includes:

s501, sending a sub-image of a picture to be detected into a trained lightweight YOLOv5S model for detection, if the sub-image of the picture to be detected does not meet the requirement of the model for the size of the picture, carrying out adaptive picture scaling, sending the sub-image into a feature extraction network to obtain a feature map with the size of S multiplied by S, and dividing an input image into small lattices with the size of S multiplied by S;

s502, predicting B bounding boxes by using logistic regression for each grid, if the center of the predicted bounding box is in a grid unit, classifying and frame predicting the target by the B bounding boxes of the grid unit to obtain the prediction result of each grid on the B bounding boxes, outputting the position information of the bounding boxes, the confidence coefficient indicating whether the grid contains the target and the probability information of C classes, and predicting t by each bounding box to obtain_x、t_y、t_w、t_h、t_o，t_x、t_yIs the offset value of the bounding box center coordinates relative to the current grid cell; using logically activated pairs t_xAnd t_yCarrying out normalization processing to limit the value within 0-1, t_w、t_hIs the scaling of the bounding box width and height, t_oIs confidence;

s503, adopting a feature pyramid network to downsample and transmit a strong semantic feature from top to bottom and a path aggregation network to upsample and transmit a strong positioning feature from bottom to top to fuse detection results of three scales respectively; for a picture input size of 960 × 960, the output feature maps are 120 × 120, 60 × 60, 30 × 30, respectively, 8-fold, 16-fold, and 32-fold down-sampled results, respectively.

Further, in step S502, the coordinates b of the center point of the predicted bounding box in the whole feature map are obtained according to the 5 values predicted by each bounding box_x、b_yAnd length and width b_w、b_hThe following were used:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

where the sigma function is logically active, c_xAnd c_yRespectively, the distance, p, of the current grid cell with respect to the top left corner of the feature map_wAnd p_hThe length and width of the prior box.

Further, the coordinate offset and the confidence are limited within 0-1, when the real box is in the grid cell, Pr (object) is 1, otherwise Pr (object) is 0, the grid cell belongs to a certain class of probability Pr (class) under the condition of containing the object_i| object) is expressed as

Wherein the content of the first and second substances,

pr (class) which is the intersection ratio of the real frame and the predicted frame_i) The probability of the corresponding category of the target in a certain cell is obtained.

Specifically, step S6 specifically includes:

s601, calculating the position information of the target on the large graph according to the position information of the target on the sub-graph and the relative position of the sub-graph on the large graph;

s602, aiming at a certain class, setting an NMS threshold value to be 0.65, selecting a boundary box with the highest confidence coefficient, filtering all boundary boxes exceeding the NMS threshold value according to the DIOU values of the boundary box and other boundary boxes, and performing picture frame according to the reserved prediction box after the prediction box is screened, so as to finish ship detection of the large-size SAR image.

Another technical solution of the present invention is an SAR image ship detection system based on lightweight deep learning, comprising:

the data module is used for preprocessing the large-size SAR image and selecting a subgraph containing target information as a training sample;

the processing module is used for introducing a Ghost module and a Ghost Bottleneck to upgrade the YOLOv5s model to obtain a primary lightweight YOLOv5s model, and training the YOLOv5s model by using training samples selected by the data module;

the fine tuning module is used for distilling the YOLOv5s model obtained after the training of the processing module, then performing sparse training and pruning, and performing fine tuning training on the pruned YOLOv5s model;

the reasoning module is used for carrying out reasoning acceleration on the YOLOv5s model after the fine tuning training of the fine tuning module by using a TensorRT reasoning optimizer and deploying the model on NVIDIA Jetson TX 2;

the detection module is used for cutting the SAR image to be detected and then sequentially sending the SAR image to the inference module to be deployed on a YOLOv5s model on NVIDIA Jetson TX2 for detection to obtain a corresponding sub-graph detection result;

and the removing module is used for splicing the sub-graph detection results obtained by the detection module, inhibiting and screening a prediction frame on the final large-size SAR image by using the NMS non-maximum value, drawing the prediction frame on the original large-size image according to the value of the screened prediction frame and marking the category, so that the large-size SAR image ship detection is realized.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to an SAR image ship detection method based on lightweight deep learning, which adopts the technical means of network pruning, knowledge distillation and Ghost algorithm, directly scales and inputs the image to a network to cause excessive information loss aiming at the characteristic of large size of a remote sensing image, and adopts a mode of cutting the image into blocks with a certain contact ratio to avoid the network information loss and ensure that the image size is matched with the network input; combining traditional model compression algorithm network pruning and knowledge distillation with a manually designed lightweight model Ghost, and upgrading a target detection network YOLOv 5; the parameter quantity and the floating point operand of the model are reduced to a great extent, and the reasoning speed is improved.

Furthermore, the sizes and the formats of the pictures are unified aiming at the blocks cut by the data sets with different sizes and formats with a certain contact ratio, and the sizes and the formats of the training samples of the input models are ensured to be consistent.

Further, a convolution module and a bottleneck module in the YOLOv5s model are optimized and upgraded by using a lightweight model Ghost, a width multiplier is adjusted to be 0.15, a depth multiplier is adjusted to be 0.35, and the number of network layers is reduced to 212 layers, so that parameters of the model and floating point operation amount are reduced.

Furthermore, in order to further compress the model, a traditional model compression algorithm, network pruning and knowledge distillation are introduced, the knowledge distillation teaches the superior performance of a large model YOLOv5m to the light-weighted YOLOv5s, the performance of the model is improved to a certain extent, and the network pruning cuts out relatively unimportant neurons by measuring the importance of the neurons, so that model parameters and floating point operand are further reduced.

Further, the meaning of the lightweight model is that in order to implement the deployment of the deep learning model on the embedded device, step S4 deploys the lightweight yollov 5S on NVIDIA Jetson TX2 by using a TensorRT inference optimizer.

Further, the detection process of the light-weight YOLOv5S on the large-size SAR image is explained by step S5, and a result graph marked with the prediction frame and the category information is finally obtained.

Further, the description has been made by calculating the center point coordinates and the length and width of the prediction bounding box in the entire feature map in step S502.

Further, the probability Pr (class) that a lattice cell belongs to a certain class under the condition of containing an object_i| object) is the result of the network output

Further, step S6 restores the sub-graph result of the to-be-detected picture to the original-size picture, and filters the prediction frame with higher repetition degree through NMS, so as to obtain the final detection result.

In conclusion, the invention provides a complete model lightweight process, finally obtains a lightweight YOLOv5s model, deploys the lightweight YOLOv5s model on an embedded device NVIDIA Jetson TX2, and completes ship and warship tasks of large-size SAR images.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a graphical representation of the Ghost model;

FIG. 3 is a GhostBottleneck diagram;

FIG. 4 is a schematic diagram of a key part of a complex model after confidence degrees of complex scene detection results are concealed;

FIG. 5 is a schematic diagram of a key part of a complex model for detecting a complex scene, where confidence is not hidden in a result;

FIG. 6 is a schematic diagram of a key part of a complex model for detecting a complex scene, where confidence is not hidden in a result;

FIG. 7 is a diagram of a key part of a simple model after confidence is concealed from a result obtained by simple scene detection;

FIG. 8 is a diagram of a key part of a simple model for detecting a simple scene, where confidence is not hidden in a result.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers and their relative sizes and positional relationships shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, according to actual needs.

The invention provides an SAR image ship detection method based on lightweight deep learning, which is oriented to embedded equipment NVIDIA Jetson TX2, relates to a model compression method, utilizes a traditional model compression algorithm and a manually designed lightweight model to compress and optimize a target detection network, and can be applied to the detection of certain specific targets in large-size synthetic aperture radar images; on the premise of meeting the acceptable precision loss, the parameter number and the floating point operand of the compression model improve the detection speed.

Referring to fig. 1, the SAR image ship detection method based on lightweight deep learning of the present invention includes the following steps:

s1, preprocessing the large-size SAR image, selecting a subgraph containing target information as a training sample, and obtaining 500 epochs of the light-weight YOLOv5S model in the step S2 on the training sample;

s101, cutting 5 10000 × 10000 pixels, 16-bit single-channel TIF images Img10K, 31 3000 × 3000 pixels and 16-bit single-channel TIFF images AIR-SARShip-1.0 into blocks with an overlap ratio of 50%, obtaining sub-images of large-size remote sensing images through cutting, and inputting the sub-images serving as training samples into a network for training;

s102, amplifying 1000 SAR-train-int images of 800 × 800 pixels and 8 bits of JPG to 1000 × 1000;

s103, unifying Img-10K, AIR-SARShip-1.0, AIR-SARShip-2.0 and SAR-train-int image formats into 8-bit single-channel TIF images, wherein the finally established data set comprises 2551 images, training samples are divided into 2351, and 200 samples are verified;

and S104, splicing the four pictures in a random scaling, random cutting and random arrangement mode by using a Mosaic data enhancement algorithm, and increasing the number of small target samples to enable the training data to tend to be uniform in distribution.

S2, a Ghost module and a Ghost Bottleneck are introduced to upgrade the YOLOv5S model, the preliminary lightweight of the YOLOv5S model is completed, and the training samples selected in the step S1 are used for training 500 epochs of the YOLOv5S model;

the idea of the Ghost module instead of standard convolution is to use a small number of eigen-feature maps to perform cheap "Ghost" after linear transformation as the output feature map. The method utilizes the similarity between redundant feature diagram pairs, and can obtain the hypothesis of a large number of similar redundant feature diagrams through simple linear transformation based on a small number of intrinsic feature diagrams, thereby realizing the purpose of compressing convolution parameters and operand. The Ghost module decomposes a standard convolution into two parts, wherein the first part generates a small amount of intrinsic feature maps by using a small amount of standard convolution, and the second part generates a large amount of 'Ghost' feature maps, namely redundant feature maps at extremely low cost by performing simple linear operation on the intrinsic feature maps.

S201, the specific operation of upgrading YOLOv5S by using the Ghost module and the Ghost bottleneck is to replace the convolution module and the bottleneck module in the backbone network of the YOLOv5S model with the Ghost module and the Ghost bottleneck, respectively, as shown in fig. 3.

S202, since the Ghost module greatly increases the network depth, it is considered to reduce the increase in the network depth due to the Ghost module by changing the depth multiplier. The width multiplier was adjusted to 0.15, the depth multiplier was adjusted to 0.35, and the number of network layers was reduced to 212 layers, resulting in a preliminary lightweight YOLOv5s model.

Two multipliers are adjusted: the width multiplier and the depth multiplier are adjusted to reduce the number of network layers, and the process is called as preliminary lightweight.

S3, on the basis of the YOLOv5S model obtained in the step S2, further lightening the YOLOv5S preliminary lightening model by utilizing network pruning and knowledge distillation of a traditional model lightening algorithm to obtain a YOLOv5S model;

distilling the initial lightweight YOLOv5S model obtained in the step S2, performing thinning training after distillation, pruning, and performing fine tuning training on the pruned model to recover the precision.

S301, using YOLOv5m as a teacher model (T-model), using L2 loss as a distillation basis function, selecting a distillation dist equilibrium coefficient in loss as 1, and carrying out distillation training on 100 epochs;

s302, after the over-parameterized model is obtained through normal training, sparse parameters 6e-4 are set, L1 regularization is carried out on gamma parameters of the BN layer through sparse training, and a sparse weight matrix is generated. This was used as a criterion for evaluating the size of the neuron contribution, and a threshold was determined according to the 30% sparsity rate. Cutting off a layer smaller than a threshold value and a dependent layer of the layer, and if all channels in the layer need to be removed, reserving a maximum channel for ensuring a network structure;

step S301 is to perform distillation optimization on the preliminary lightweight model, and step S302 is to further prune the distilled model to obtain a pruned model.

And S303, after pruning is finished, in order to ensure that the precision of the model is not greatly reduced, training the pruned model obtained in the step S302 for 50 epochs continuously, and learning the final weight of sparse connection through fine tuning training.

S4, carrying out inference acceleration on the YOLOv5S model obtained in the step S3 by using a TensorRT inference optimizer, and deploying the model on NVIDIAJetson TX2, wherein the TensorRT inference optimizer carries out deployment and comprises a Build stage and a Deploymeng stage;

s401, optimizing at a Build stage to obtain an pt file by using a Pythrch training model, converting the pt file into an onnx model, loading the onnx model in TensorRT, converting the onnx model into a TensorRT model, and storing the TensorRT model into a disk or a memory in a serialized mode, wherein the file is called a plan file;

and S402, deploying a lightweight YOLOv5 model in a Deployment stage, and finishing a forward reasoning process. Firstly, deserializing a plan file obtained in the Build process, and creating a runtime engine for reasoning.

S5, after the large-size SAR image to be detected is cut, the large-size SAR image to be detected is sequentially sent to a YOLOv5S model deployed on NVIDIA Jetson TX2 in the step S4 to complete detection;

similar to the generation of training samples, sub-graphs which are cut into 1000 × 1000 blocks of large-size SAR images at an overlap (coincidence ratio) of 50% are sequentially sent to a model for detection.

S501, sending a sub-image of a picture to be detected into a trained lightweight YOLOv5S model for detection, if the sub-image does not meet the requirement of the model for the size of the picture, carrying out adaptive picture scaling, sending the sub-image into a feature extraction network to obtain a feature map with the size of S multiplied by S, and dividing an input image into small lattices with the size of S multiplied by S;

s502, predicting B bounding boxes by using logistic regression for each grid, and if the center of the predicted bounding box is in a grid unit, classifying and frame predicting the target by the B bounding boxes of the grid unit to obtain a prediction result of each grid on the B bounding boxes;

and outputting position information of the bounding box, confidence indicating whether the grid contains the target or not and probability information of C categories. Each bounding box predicts 5 values: t is t_x、t_y、t_w、t_h、t_o。t_x、t_yIs the offset value of the bounding box center coordinates relative to the current grid cell. Meanwhile, in order to ensure that the center of the bounding box is restricted in the current grid unit, a logic activation (Logistic) is used for t_xAnd t_yPerforming normalization processing to obtain t_xAnd t_yThe value of (2) is limited within 0-1, so that the model training is more stable; t is t_w、t_hIs the scaling of the bounding box width and height, t_oIs a confidence, mentioned in RCNN

t_hSo too does the calculation of (c).

According to the 5 values predicted by each bounding box, the center point coordinate b of the predicted bounding box in the whole feature map can be calculated according to the following formula_x、b_yAnd length and width b_w、b_h。

b_x＝σ(t_x)+c_x (1)

b_y＝σ(t_y)+c_y (2)

Wherein, c_xAnd c_yIs the distance, p, of the current grid cell relative to the upper left corner of the feature map_wAnd p_hIs the a priori box length and width. The sigma function is activated logically, and coordinate offset and confidence coefficient are limited within 0-1. When the real frame falls within the grid cell, the probability Pr (object) of the real frame falling within the grid cell is 1, otherwise Pr (object) is 0.

Probability Pr (class) that a certain grid cell belongs to a certain class under the condition of containing an object_i| object) is expressed as:

wherein the content of the first and second substances,

S503, the feature pyramid network FPN downsampling conveys strong semantic features from top to bottom and the path aggregation network PAN upsampling conveys strong positioning features from bottom to top to fuse detection results of three scales respectively.

For a picture input size of 960 × 960, the output feature maps are 120 × 120, 60 × 60, 30 × 30, respectively, 8-fold, 16-fold, and 32-fold down-sampled results, respectively.

And S6, splicing the sub-graph detection results obtained in the step S5, using an NMS non-maximum value to inhibit and screen a prediction frame on the final large-size SAR image, drawing the prediction frame on the original large-size image according to the value of the screened prediction frame, drawing a reserved prediction frame on the original image, marking the category, and completing the target detection of the large-size SAR image.

S601, the splicing process is the reverse process of the dicing process, and the position information of the target on the large graph is calculated according to the position information of the target on the sub-graph and the relative position of the sub-graph on the large graph;

s602, aiming at a certain class, setting an NMS threshold value to be 0.65, selecting a boundary box with the highest confidence coefficient, filtering all boundary boxes exceeding the NMS threshold value according to the DIOU values of the boundary box and other boundary boxes, removing the boundary box with a high repetition rate, screening the prediction boxes by the NMS, performing picture frame according to the reserved prediction boxes after the prediction boxes are screened, and completing ship detection of the large-size SAR image.

In another embodiment of the present invention, a SAR image ship detection system based on lightweight deep learning is provided, which can be used to implement the above SAR image ship detection method based on lightweight deep learning, and specifically, the SAR image ship detection system based on lightweight deep learning includes a data module, a processing module, a fine-tuning module, an inference module, a detection module, and a removal module.

and the removing module is used for splicing the sub-image detection results obtained by the detection module, inhibiting and screening a prediction frame on the final large-size SAR image by using the NMS non-maximum value, and drawing the prediction frame on the original large-size image according to the value of the screened prediction frame to realize the SAR image ship detection.

In yet another embodiment of the present invention, a terminal device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor being configured to execute the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is adapted to implement one or more instructions, and is specifically adapted to load and execute one or more instructions to implement a corresponding method flow or a corresponding function; the processor provided by the embodiment of the invention can be used for the operation of the SAR image ship detection method based on the lightweight deep learning, and comprises the following steps:

preprocessing a large-size SAR image, and selecting a subgraph containing target information as a training sample; a Ghost module and a Ghost Bottleneck are introduced to upgrade the YOLOv5s model to obtain a primary lightweight YOLOv5s model, and a training sample is used for training the YOLOv5s model; distilling the trained YOLOv5s model, then performing sparseness training and pruning, and performing fine tuning training on the pruned YOLOv5s model; carrying out inference acceleration on the fine-tuning trained YOLOv5s model by using a TensorRT inference optimizer, and deploying the model on NVIDIA Jetson TX 2; after the SAR image to be detected is cut, the SAR image is sequentially sent to a YOLOv5s model deployed on NVIDIA Jetson TX2 for detection, and a corresponding sub-graph detection result is obtained; and splicing the obtained sub-image detection results, inhibiting and screening a prediction frame on the final large-size SAR image by using the NMS non-maximum value, and drawing the prediction frame on the original large-size image according to the value of the screened prediction frame to realize the SAR image ship detection.

In still another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor can load and execute one or more instructions stored in the computer readable storage medium to realize the corresponding steps of the SAR image ship detection method based on the lightweight deep learning in the embodiment; one or more instructions in the computer-readable storage medium are loaded by the processor and perform the steps of:

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The effects of the present invention can be further illustrated by the following experiments:

1. experimental Environment

The simulation environment of the training host is as follows: ubuntu 18.04, Intel (R) Xeon (R) Gold 5118CPU, and the GPUs are GeForce RTX 2080ti, python3.8.5, CUDA 10.0.130 and CuDNN 7.0.

Jetson TX2 inference environment: ubuntu 18.04, CPU HMP Dual Denver 2/2MB L2+ Quad

A57/2MB L2, the GPU NVIDIA Pascal, 256CUDA cores, python3.8.5, CUDA 10.2.89, CuDNN 8.0.0.180, TensorRT version 7.1.3.0, Jetpack version 4.4.1.

2. Content of the experiment

(1) And (3) verifying the effectiveness of a Ghost module on model lightweight on a GPU (graphics processing Unit) and a CPU (Central processing Unit) of a training host by taking YOLOv5s as a baseline model respectively, and recording model parameter number, floating point operand, average precision AP50, AP50:95, precision P, recall R, inference time for processing a 1000 x 1000 picture and total processing time. The Ghost model is demonstrated to perform for parameter and floating point operand compression as shown in fig. 2.

(2) The preliminary lightweight model was distilled and pruned, the model was tested on NVIDIA Jetson TX2, and the inference time and total processing time for processing a 1000 x 1000 picture were recorded. The performance of the model on inference time acceleration is verified.

3. Simulation experiment results

Experimental results show that the Ghost module has large compression on the baseline model YOLOv5s in parameter quantity and floating point operand. The distillation and pruning strategies used can further compress the model and greatly increase the inference speed.

TABLE 1 comparison of the Performance of Mobile and Ghost modules in YOLOv5

Model

All

L

Weights(M)

FLOPs(G)

AP₅₀

AP_50:95

P

R

Infer(ms)

Total(ms)

YOLOv5s

\

224

6.72

16.3

61.0％

33.5％

84.5％

56.6％

8

54

YOLOv5m

\

308

20.06

50.3

64.3％

33.5％

77.9％

60.1％

13

61

GhostYOLOv5

326

4.44

9.7

62.8％

33.7％

78.5％

59％

11

60

GhostYOLOv5

√

362

2.69

6.5

60.5％

32.9％

80.3％

58.4％

12

60.9

The performance pair ratios of the respective models are shown in table 1. The experiments in the tables were all done on a laboratory server. The image sizes are 960 x 960 to accommodate the network input. Where all represents whether all convolutional modules and bottleneck blocks in the network are replaced, and L represents the number of network layers. It can be seen that there is significant compression on the parameter quantity and the floating point operand of the model by Ghost, which reduces the parameter quantity of YOLOv5s from 6.72M to 4.44M, and reduces the floating point operand from 16.3G to 9.7G, while maintaining a certain precision of Ghost YOLOv5s at AP₅₀And AP_50:95The above are slightly higher than YOLOv5s, but the inference speed on the GPU is not as high as that of the parameter and floating point operand, which are both larger than YOLOv5 s. And (4) considering that the computational power bottleneck of the GPU is the memory access bandwidth, and only replacing a backbone network for reducing the network layer number and improving the reasoning speed. The model inference speed is tested on the CPU. The results are shown in Table 2.

TABLE 2 model CPU inference time comparison

Model

Weights(M)

FLOPs(G)

AP₅₀

AP_50:95

P

R

Infer(ms)

Total(ms)

YOLOv5s

6.72

16.3

61.0％

33.5％

84.5％

56.6％

510

546

GhostYOLOv5

4.44

9.7

62.8％

33.7％

78.5％

59％

440

499

The inference speed of the Ghost YOLOv5s on the CPU is significantly faster than YOLOv5s, and the total processing time is also faster than YOLOv5s, which shows that Ghost is effective for the lightweight of the network model. It is sufficient to demonstrate the superior performance of the Ghost model on network compression.

TABLE 3 Depth multiplier and Width multiplier impact on inference time

Model	Depth	Width	Weights(M)	FLOPs(G)	AP₅₀	AP_50:95	Infer(ms)	Total(ms)
									GhostYOLOv5	0.33	0.50	4.44	9.7	62.8％	33.7％	11	60
GhostYOLOv5	0.15	0.35	2.22	5.1	63.0％	34.8％	12	60.9

YOLOv5 implements four models of different sizes by adjusting the width multiplier (width multiplex) and the depth multiplier (depth multiplex). Since the Ghost module can bring a large increase in the network depth, it is considered to reduce the increase in the network depth brought about by the Ghost module by changing the depth multiplier. The width multiplier was adjusted to 0.15 and the depth multiplier was adjusted to 0.35. The number of network layers is reduced to 212 layers. The test reasoning speed and the total processing speed on the server GPU are both improved to a certain extent, but the precision of the model is not reduced. It is true that measures to alter the width multiplier and the depth multiplier to control the width and depth of the model are effective.

TABLE 4 TensorRT inference Performance comparison

Model	Pruning	Distillation	Weights(M)	FLOPs(G)	AP₅₀	AP_50:95	Infer(ms)	Total(ms)
									YOLOv5s			6.72	16.3	61.1％	33.5％	70.38	121.82
GhostYOLOv5			2.22	5.1	63.0％	34.8％	61	109.77
									GhostYOLOv5	√		1.62	3.0	61.6％	32.2％	40.5	90.7
GhostYOLOv5		√	2.22	5.1	63.0％	32.3％	59.4	108.66
									GhostYOLOv5	√	√	0.89	1.8	57.3％	27.7％	30.2	84.5

After the Yolov5s subjected to the Ghost lightweight is subjected to distillation, pruning and fine tuning training, the parameter number and floating point operand of the model are greatly reduced. A loss of some accuracy is inevitable, but this loss is within an acceptable range. The picture inference time of a 1000 x 1000 picture on TX2 is only 30.2ms, and the total time period including reading the picture and post-processing is also only 84.5 ms. It is sufficient to prove the superiority of ghostyov 5. Analysis of table 4 revealed that distillation had little effect on improving the accuracy of the model, but the model cut by the same pruning strategy after distillation could be made more lightweight. Considering distillation allows the model weight distribution to be more dense, making important weights more important and less important than less important weights. A more sparse matrix can be obtained in the sparse training.

On a 10000 × 10000 test image, fig. 4 shows a key part of the detection result. And hiding the confidence information in order not to shield the small target. The two models obtained finally by the experiment are respectively a complex model only subjected to pruning and a simple model subjected to distillation pruning. The two models have great difference in parameter quantity and floating point operand, and the aimed image complexity is different. The inference time of a single picture of the complex model on the TX2 by GhostYOLOv5s is 40ms, and if the model loading, picture reading and post-processing stages are included, the total processing time of the single picture is about 92 ms. The simple model has the inference time of a single picture on TX2 of ghost yolov5s of 30.2ms, and if the model loading, picture reading and post-processing stages are included, the total processing time of the single picture is about 84.5 ms.

TABLE 5 comparison of detection effects of different models in complex scenes

Model

Weights(M)

FLOPs(G)

AP₅₀

AP_50:95

Miss

Fake

F₁

Infer(s)

Total(s)

Complex

1.62

3.0

69％

42.3％

35.6％

13.5％

0.738

14.44

32.2

Simple

0.89

1.8

56.1％

27.3％

49.5％

22.9％

0.61

10.9

30.5

TABLE 6 comparison of detection effects of different models in simple scene

Model

Weights(M)

FLOPs(G)

AP₅₀

AP_50:95

Miss

Fake

F₁

Infer(s)

Total(s)

Complex

1.62

3.0

81.6％

50％

27.9％

13％

0.789

14.44

32.2

Simple

0.89

1.8

55.1％

24.6％

46.3％

36.7％

0.581

10.9

30.5

Fig. 5 and 6 show that the confidence level picture is not hidden by the complex model. The images are river channels, the detection complexity of the shore is high, and the confidence coefficient of the detection can be kept at a high level.

The complexity mainly aims at complex scenes, and the simple model mainly aims at simple scenes. Fig. 7 and 8 show the detection effect of the simple model on the simple scene. The simple model shows superior performance in the range sea area. Therefore, the selection of the proper model for images of different complexity can realize the optimization of the reasoning speed.

In conclusion, the SAR image ship detection method based on lightweight deep learning deploys the obtained lightweight YOLOv5s model on the embedded equipment NVIDIA Jetson TX2, completes ship tasks of large-size SAR images, and can effectively detect ships in both simple scenes and complex scenes of the SAR images.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A SAR image ship detection method based on lightweight deep learning is characterized by comprising the following steps:

2. The method according to claim 1, wherein step S1 is specifically:

s102, amplifying 1000 8-bit JPG images SAR-train-int;

3. The method according to claim 1, wherein step S2 is specifically:

4. The method according to claim 1, wherein step S3 is specifically:

5. The method according to claim 1, wherein in step S4, the deployment of the TensorRT inference optimizer includes a Build phase and a depolyymeng phase, specifically:

6. The method according to claim 1, wherein step S5 is specifically:

7. The method according to claim 6, wherein in step S502, the coordinates b of the center point of each predicted bounding box in the whole feature map are obtained according to the 5 predicted values of each bounding box_x、b_yAnd length and width b_w、b_hThe following were used:

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

8. The method of claim 7, wherein the coordinate offset and confidence are limited to be within 0-1, when the real box is in the grid cell, Pr (object) is 1, otherwise Pr (object) is 0, and the grid cell belongs to a certain class of probability Pr (class) under the condition of containing the object_i| object) is expressed as

Wherein the content of the first and second substances,

9. The method according to claim 1, wherein step S6 is specifically:

10. A SAR image ship detection system based on lightweight deep learning is characterized by comprising: