CN116645608A - Remote sensing target detection based on Yolox-Tiny biased feature fusion network - Google Patents

Remote sensing target detection based on Yolox-Tiny biased feature fusion network Download PDF

Info

Publication number
CN116645608A
CN116645608A CN202310622397.7A CN202310622397A CN116645608A CN 116645608 A CN116645608 A CN 116645608A CN 202310622397 A CN202310622397 A CN 202310622397A CN 116645608 A CN116645608 A CN 116645608A
Authority
CN
China
Prior art keywords
frame
feature fusion
fusion network
detection
remote sensing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310622397.7A
Other languages
Chinese (zh)
Inventor
胡昭华
李昱辉
王长富
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202310622397.7A priority Critical patent/CN116645608A/en
Publication of CN116645608A publication Critical patent/CN116645608A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing target detection method based on a YOLOX-Tiny biased feature fusion network, which comprises the following steps: s1, dividing a remote sensing data set DIOR into a test set and a training set according to a certain proportion; s2, carrying out unified size processing on all pictures in the training set and the testing set; s3, introducing a multi-scale feature fusion network and deformable convolution on the basis of the YOLOX-Tiny, building a biased feature fusion network, sending a training set into the biased feature fusion network, and training by adopting a SIoU loss function; s4, inputting the test set into a biased feature fusion network to perform performance test. The method can improve the prediction capability of the model on the target with larger deformation, solve the problem of mismatching direction of the real frame and the prediction frame in the training process, accelerate model convergence and enable the detection model to obtain further performance improvement.

Description

Remote sensing target detection based on Yolox-Tiny biased feature fusion network
Technical Field
The invention relates to the field of computer vision and target detection, in particular to remote sensing target detection based on a YOLOX-Tiny biased feature fusion network.
Background
Target detection is a fundamental and challenging task in the field of computer vision, one of the most active research topics in the field of computer vision over the last decades. The task of object detection is defined as: the computer can automatically identify the categories of all objects in the picture or video frame and draw a bounding box around the objects, marking the location of each object. The object detection has wide application in the fields of environment detection, video monitoring, traffic management, remote sensing images and the like, and is therefore concerned by academia and industry.
In recent years, the continuous development of remote sensing optical technology greatly improves the quantity and quality of remote sensing images, and the remote sensing images are utilized to carry out tasks such as target detection, image segmentation and image classification, thereby bringing great convenience to the fields such as environment monitoring, traffic management, circuit inspection and the like. The remote sensing image target detection task has the characteristics of rich small target object occupation ratio, high background complexity, high similarity and intra-class diversity, large target scale change in different images and the like, so that the detection accuracy of a target detection algorithm applied to the remote sensing image at present is poor.
With the rapid development of deep learning, target detection algorithms based on convolutional neural networks (Convolution Neural Network, CNN) are gradually applied to remote sensing image target detection, and these target detection algorithms can be classified into a two-stage detection algorithm and a single-stage detection algorithm. The two-stage detection algorithm model mainly comprises R-CNN (GIRSHICK R, DONAHUE J, DARRELL T, et al Rich feature hierarchies for accurate object detection and semantic segmentation [ C ]// procedures of 2014IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C, USA: IEEE Press,2014: 580-587), fast R-CNN (GIRSHICK R.fast R-CNN [ C ]// procedures of 2015IEEE International Conference on Computer Vision.Santiago:IEEE,2015:1440-1448), and the like, which firstly utilizes a suggested network to generate candidate regions, then carries out classification recognition on the regions, and has higher detection precision but slower detection speed. The single-stage detection algorithm model mainly comprises YOLO (Redmon J, divvla S, girsheck R, et al you only look once: unlocked, real-time object detection [ C ]// procedures of 2016IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2016:779-788.), SSD (LIU W, ANGUELOV D, ERHAN D, et al SSD: singleshot multibox detector [ C ]// Proceedings of European Conference on Computer vision. Cham: springer, 2016:21-37.) and the like, which does not require the generation of candidate regions, and directly carries out regression on the position and type of the target, and has a faster detection speed, but lower detection accuracy. These general target detection algorithms are difficult to apply directly to remote sensing images, and require targeted design and optimization of the algorithms. Zhang et al up-samples each candidate region extracted at the previous stage by deconvolution based on the Faster RCNN to enlarge the feature map size and improve the small target detection effect in the remote sensing image (Zhang W, wang S H, thachan S, et al Deconv R-CNN for small object detection on remote sensing images [ C ]// Proceedings of the 2018IEEE International Geoscience and Remote Sensing Symposium.Valencia,Spain:IEEE,2018.2483-2486.). Yang et al increase the detection accuracy of small target vessels by increasing the number and scale of shallow feature pyramids and enhance the feature expression capability of small targets using densely connected structures (Yang X, sun H, sun X, et al position detection and direction p-rediction for arbitrary-oriented ships via multitask rotation region conv-olutional neural network [ J ]. IEEE Access,2018,6:50839-50849.DOI: 10.1109/ACCESS.2018.2869884.). Zhu et al, based on YOLOv5, utilized a self-attention module transducer to replace the original predictive head, improving dense object detection capability in the remote sensing image (Zhu X K, wang X, zhao Q, et al TPH-YOLOv5: improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios [ C ]// Proceedings of the 2021IEEE Conference on Computer Vision Workshops.Montreal,BC,2022:2778-2788.). The above research methods have improved algorithms in a targeted manner, but these algorithms are less robust.
Disclosure of Invention
The invention aims to: the invention aims to provide the remote sensing target detection based on the YOLOX-Tiny biased feature fusion network, which improves the robustness in the target detection of the remote sensing image and has real-time detection.
The technical scheme is as follows: the remote sensing target detection method comprises the following steps:
s1, dividing a remote sensing data set DIOR into a test set and a training set according to a certain proportion;
s2, carrying out unified size processing on all pictures in the training set and the testing set;
s3, introducing a multi-scale feature fusion network and deformable convolution on the basis of the YOLOX-Tiny, building a biased feature fusion network, sending a training set into the biased feature fusion network, and training by adopting a SIoU loss function;
s4, inputting the test set into a biased feature fusion network to perform performance test.
Further, in step S3, in the multi-scale feature fusion network structure, an extra edge is added between the P3 input node and the output node; the convolution structure of the convolution kernel adopts a depth separable convolution structure.
Further, in step S3, a CBS structure in the prediction end of the original YOLOX-Tiny model is replaced by a deformable convolution, where the deformable convolution performs feature extraction by applying a parallel conventional convolution to obtain a new feature map with the same length and width as the input feature map and the number of channels being 2N, where N is the number of sampling points of the convolution kernel, and the new feature map includes offsets Δp of x-axis and y-axis n
And (3) calculating the characteristic value on the shifted coordinate point by utilizing the principle of bilinear interpolation on the offset through the characteristic value of the point on the original input characteristic map, and completing the self-adaptive sampling of the spatial characteristic, wherein the expression of the deformable convolution output characteristic map Y is as follows:
wherein w is convolution kernel weight, X is input feature map, and P n Is a convolution kernelSampling positions; p (P) 0 Is at any position.
Further, in step S3, the SIoU loss function includes a shape loss, a IoU loss, a distance loss, and an angle loss included in the distance loss;
the definition of the angle loss is as follows:
wherein ,ch For the height difference between the center points of the real frame and the predicted frame, the expression is as follows:
σ is the distance between the center points of the real frame and the predicted frame, and the expression is as follows:
wherein ,is the center coordinates of the real frame, (b) cx ,b cy ) The central coordinate of the prediction frame;
when the angle alpha between the prediction frame and the real frame is 0 DEG or 90 DEG, the angle loss is 0;
in the training process, if the predicted frame B and the real frame B GT The angle alpha between the two angles is smaller than 45 degrees, and the prediction frame is moved towards the direction of reducing alpha; otherwise, moving in the direction of decreasing β, β+α=90°;
the definition of the distance loss is as follows:
wherein ,γ=2-Λ;c w 、c h respectively a prediction frame B and a real frame B GT The width and height of the minimum external frame, and theta is the shape weight;
the definition of shape loss is as follows:
wherein ,(w, h) is the width and height of the prediction block, (w) gt ,h gt ) The width and the height of the real frame;
the definition of the final SIoU bounding box regression loss function is as follows:
wherein IoU =s 1 /S 2 ,S 1 For predicting frame B and true frame B GT Area of intersection S 2 For predicting frame B and true frame B GT Area of the phase.
In step S4, the detection performance of the biased feature fusion network on the test set is measured by using the detection precision, the detection speed and the model complexity, wherein the detection precision comprises an accurate P, a recall rate R, an average precision AP and an average precision mean value mAP, and the model detection speed index uses a per second detection image frame number FPS.
Further, the detection precision comprises an accurate P, a recall rate R, an average precision AP and an average precision average mAP, and the specific calculation formula is as follows:
TP is the number of positive samples correctly predicted by the model; FP is the number of positive samples for model misprediction; FN isPositive samples predicted by the model as negative classes; n is the number of target categories; AP (Access Point) i The average precision for the i-th target.
Further, the detection speed index adopts the number of frames per second of detected image FPS, and the calculation formula is as follows:
wherein N is the number of pictures; t is the detection time;
the model complexity adopts parameter Params, and the calculation formula is as follows:
Params=C o ×(k w ×k h ×C i +1)
wherein :Co Representing the number of output channels, C i Representing the number of input channels, k w 、k h Respectively representing the width and height of the convolution kernel.
Compared with the prior art, the invention has the following remarkable effects:
1. through improving the multiscale feature fusion network to be a multiscale bias feature pyramid (Multi Biased Texture Feature Pyramid Networks, MBTFPN), the structure increases the fusion of a P2_in shallow network, so that fusion nodes can transmit shallow strong positioning information and edge features, and meanwhile, the structure adds an extra edge between P3_in, P4_in, P5_in input nodes and P3_out, P4_out and P5_out output nodes, thereby preventing the loss of feature information of a prediction end, and the structure can effectively improve the detection capability of a model on different scale targets, especially small targets;
2. by replacing the common convolution of the prediction end with the deformable convolution, the sampling points of the traditional convolution kernel are not rectangular areas but arbitrary areas, so that the shape change of the target is effectively processed, and the prediction capability of the model on the target with larger deformation is improved;
3. through optimizing the loss function as SIoU, the loss function can enable the boundary frame to move to the correct direction in the training process, the problem of mismatching direction of the real frame and the prediction frame in the training process is solved, the model convergence is accelerated, and further performance improvement of the detection model is achieved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a network structure diagram of a remote sensing target detection algorithm according to the present invention;
FIG. 3 is a block diagram of MBTFPN of the invention;
FIG. 4 is a schematic diagram of a deformable convolution of the present invention;
figure 5 (a) is a schematic view of the calculation of the angular loss,
FIG. 5 (b) is a schematic diagram of the calculation of the distance loss;
FIG. 6 is a mAP50 radar graph of the present invention on a DIOR with an original target detection algorithm.
Detailed Description
The invention is described in further detail below with reference to the drawings and the detailed description.
In order to balance the detection accuracy and detection speed of an algorithm on a remote sensing image, the invention provides a lightweight remote sensing target detection algorithm based on YOLOX-Tiny. The algorithm first aggregates the network PANet into MBTFPN through an alternate path, the detection capability of the model on small targets in the remote sensing image is improved; secondly, replacing partial convolution in MBTFPN with depth separable convolution to reduce model parameter quantity, and the structure can accelerate model detection; then introducing deformable convolution at a prediction end of the model to improve the feature capturing capability of different scales or deformation targets in the remote sensing image; finally, the loss function is optimized to further improve the detection performance of the model. On the premise of keeping high real-time detection, the detection model provided by the invention obtains good performance results.
The invention realizes the accurate detection of remote sensing targets with different scales by improving the multi-scale feature fusion network and the prediction end, and can be applied to the fields of environment detection, traffic management, circuit inspection and the like. As shown in fig. 1, the detailed implementation steps of the present invention are as follows:
step 1, firstly, a remote sensing data set DIOR is divided into a test set and a training set according to a ratio of 5:5. The dataset DIOR is a large-scale, public remote sensing target detection dataset that collectively contains 23463 images, 192472 example samples, each 800 pixels by 800 pixels in size, for a total of 20 categories. The data set has targets with rich scale changes, higher similarity among classes, higher diversity in classes and images with larger imaging difference, and can represent most application scenes in remote sensing target detection tasks. In order to ensure that the distribution of the training verification data and the test data are similar, 5862 remote sensing images are adopted in the training set, 5863 remote sensing images are adopted in the verification set, and the remaining 11738 remote sensing images are adopted in the test set, so that the detection result has better explanatory power and contrast.
And 2, unifying all pictures in the training set and the testing set to 640 x 640 in size, so that the testing is more fair.
And 3, introducing MBTFPN and deformable convolution on the basis of YOLOX-Tiny, and sending the training set into a network for training. The specific network structure diagram of the improved YOLOX-Tiny remote sensing target detection algorithm is shown in fig. 2. The step 3 specifically comprises the following steps:
step 31, improving the multi-scale feature fusion network structure MBTFPN. The low-level features extracted from the shallow network near the input end in the backbone network have high feature map resolution and contain more textures, details and spatial information; and the advanced features extracted by the deep network far from the input end have stronger semantic information. The multi-scale feature fusion network can effectively fuse semantic information and detail information extracted by the backbone network, and the detection capability of targets with different sizes is remarkably improved. The original YOLOX-Tiny model multi-scale feature fusion network adopts a PANet structure, and the structure can not well solve the problems of large scale span, rich small target duty ratio and the like of a remote sensing image target. The invention provides a novel multi-scale feature fusion network MBTFPN, which is shown in figure 3, under the influence of a BTFPN structure, and the structure increases the fusion of a P2-in shallow network, so that a fusion node can transmit shallow strong positioning information and edge features, and further the target detection effect is improved. Meanwhile, the structure is connected with the input nodes P3_in, P4_in and P5_in and the output nodes P3_out, P4_out and P5_out, so that the detection performance is prevented from being influenced by the prediction end due to the loss of the characteristic information. And finally, changing the convolution structure with the convolution kernel size of 3 into a depth separable convolution structure, and reducing the model parameter number. MBTFPN makes full use of detail information in the shallow network, and improves the detection capability of the model on multi-scale targets. Meanwhile, the structure can prevent the loss of characteristic information in the P3 out prediction end and improve the detection precision of the model on the target.
At step 32, a deformable convolution is introduced. The method comprises the steps of adopting two CBS structures to predict the characteristics extracted from a multi-scale characteristic fusion network by an original YOLOX-Tiny model prediction end, dividing a characteristic diagram into parts which are identical to convolution kernels, then carrying out convolution operation, wherein the positions of each part on the characteristic diagram are fixed, and for a target with complex deformation, a certain deviation exists between a prediction frame and a real frame. The deformable convolution learns the offset through the additional convolution layer, so that the sampling points of the traditional convolution kernel are not rectangular areas but arbitrary areas, the shape change of the target is effectively processed, and the prediction capability of the network on the remote sensing target is improved.
The implementation of deformable convolution differs from conventional convolution. For a conventional 3 x 3 convolution, the convolution kernel sampling range r= { (-1, -1), (0, -1), … …, (0, 1), (1, 1) } is given first, and the sum of the sample values is weighted second. For each position P 0 The calculation formula of the conventional convolution output characteristic diagram Y is shown as (1).
Wherein: w is convolution kernel weight, X is input feature map, P n Sampling positions for the convolution kernel.
The deformable convolution obtains a new feature map with the same length and width as the input feature map and the same channel number of 2N by carrying out feature extraction by using a parallel traditional convolution, as shown in the upper half part of FIG. 3, wherein N is the number of convolution kernel sampling points, and the new feature map comprises the offset of an x axis and a y axisΔP n These offsets may be learned end-to-end by gradient back propagation. After the offset is learned, the sampling point positions of the deformable convolution kernels can be adaptively changed according to the image content, so that the deformable convolution kernels are suitable for geometric deformation such as the shape, the size and the like of different objects. However, these shifted positions are only one coordinate point, and there is no real feature value, so the feature value on the shifted coordinate point needs to be calculated by using the feature value of the point on the original input feature map by using the principle of bilinear interpolation, so as to complete the adaptive sampling of the spatial feature, as shown in the lower half of fig. 3, and the calculation formula is shown in (2).
Where R represents a set of convolution kernel sampling ranges, which corresponds to the entire set of offsets for a certain coordinate point.
Step 33, optimizing the loss function to be SIoU. The function of the loss function is to measure the distance between the predicted information and the desired information (label) of the neural network, and the closer the predicted information is to the desired information, the smaller the loss function value is. The loss function in the original YoloX-Tiny model consists of three parts: bounding box regression loss, target confidence loss, classification confidence loss, wherein the bounding box regression loss employs a IoU loss function. Because a large number of small targets exist in the remote sensing image, the predicted frame and the real frame are always in a containing relation, if IoU loss function is adopted as the regression loss of the boundary frame, the degree of the coincidence degree of the real frame and the predicted frame cannot be accurately reflected, and the remote sensing image is positioned poorly. Corresponding improvement is made to the loss functions of the prediction frame, the real frame and the GIoU, DIoU, CIoU and the like according to different problems. However, these loss functions do not take into account the direction of mismatch between the real and predicted frames, which may cause the predicted frames to "wander around" during training, resulting in slower and less efficient model convergence. Therefore, the invention replaces IoU loss function with SIoU loss function, so that the boundary box moves to the correct direction, the convergence of the model is accelerated, the detection performance of the model is improved, and the SIoU loss function consists of four parts, namely shape loss, ioU loss, distance loss and angle loss contained in the distance loss.
The angular loss is defined as shown in equation (3).
wherein :ch Is the difference in height between the center points of the real and predicted frames, as shown in equation (4). σ is the distance between the center points of the real and predicted frames as shown in equation (5).
wherein :is the center coordinates of the real frame, (b) cx ,b cy ) Is the predicted frame center coordinates.
An angle loss calculation diagram as shown in fig. 5 (a) and a distance loss calculation diagram as shown in fig. 5 (b);
when predicting frame B and real frame B GT When the angle alpha between the two angles is 0 degrees or 90 degrees, the angle loss is 0 degrees, if alpha is smaller than 45 degrees in the training process, the prediction frame is moved towards the direction of reducing alpha, otherwise, the prediction frame is moved towards the direction of reducing beta, so that the angle loss is minimized, and beta+alpha=90 degrees.
The distance loss definition is shown in equation (6).
wherein :γ=2-Λ;c w 、c h respectively a prediction frame B and a real frame B GT The width and height of the smallest bounding box.
When α approaches 0 °, the contribution of the distance loss is greatly reduced; as a approaches 45 ° gradually, the contribution of the distance loss increases gradually. It follows that as the angle increases, it becomes more difficult to calculate the distance loss. Gamma can thus be seen as a dynamic weight on the distance loss related to the angle alpha.
The definition of the shape loss is shown in formula (7).
wherein :(w, h) and (w gt ,h gt ) The width and the height of the prediction frame and the width and the height of the real frame are respectively; θ is a shape weight, the degree of attention to the shape is controlled, and the θ parameter range is defined as [2,6 ] in order to avoid excessive attention to the shape loss and reduce the movement to the prediction frame]。
In summary, the final SIoU bounding box regression loss function is defined as shown in equation (8).
Wherein IoU =s 1 /S 2 ,S 1 For predicting frame B and true frame B GT Area of intersection S 2 For predicting frame B and true frame B GT Area of the phase.
The training set is sent to the modified algorithm for training, step 34. The network model is realized under a Python-based Python framework, training of the model is completed on a GPU server, hardware is configured as an AMD EPYC 7543 processor, 30GB memory and a display card of NVDIA RTXA5000, and the software environment is Ubuntu18.04 and Python 3.8,Pytorch 1.9.0. For fair comparison, the experiment adopts unified training parameters, the input image size is 640×640, the training round number is 300, the training batch is 16, the optimization algorithm adopts Adam, the momentum and attenuation coefficients are respectively set to 0.9 and 0, the learning rate is 0.001, the learning rate reduction mode adopts a cosine annealing strategy, the minimum learning rate is 1% of the current, and the IOU threshold is 0.5. The training set preprocessing adopts a Mosaic data enhancement strategy, four pictures are randomly scaled and cut, and then spliced on one picture to obtain a new picture containing a plurality of targets as training data. And then selecting two images from the image enhanced by the Mosaic data randomly through Mixup data enhancement, and mixing the images according to a certain proportion to generate a new image. The Mosaic data enhancement strategy enriches the training data set, particularly random scaling increases a plurality of small targets, and the robustness of the network is better. In order to prevent a plurality of inaccurate marking frames caused by random clipping in excessive Mosaic data enhancement and the two data enhancement strategies, the generated new image is far away from the distribution of the original image, so that the detection performance of the model is affected. The present example uses a data enhancement strategy during the first 210 rounds of training, where the probability of using the Mosaic data enhancement is 50% and the probability of using the mix up data enhancement after the Mosaic data enhancement is 50%. And data enhancement is closed in the training process of the rear 90 rounds, images in real distribution are introduced, and the generalization capability of the model is improved.
And step 4, inputting the test set into a network, performing performance test, and analyzing the experimental result. The invention adopts the detection precision and the detection speed to measure the detection performance of the biased feature fusion network on the test set. The detection accuracy includes accuracy (P), recall (R), average accuracy (average precision, AP), and average accuracy mean (mean average precision, mAP). In this embodiment, the mAP50 is used to measure the detection accuracy of the model. The specific calculation formula is as follows:
wherein: TP is the number of positive samples correctly predicted by the model;FP is the number of positive samples for model misprediction; FN is a positive sample predicted by the model to be negative; n is the number of target categories; AP (Access Point) i The average precision for the i-th target.
The detection speed index adopts the number of frames of detected images per second (frames per second, FPS), and the calculation formula is shown as (13).
Wherein: n is the number of pictures; t is the detection time.
The model complexity uses Parameters (Parameters), and the specific calculation formula is as follows:
Params=C o ×(k w ×k h ×C i +1) (14)
wherein :Co Representing the number of output channels, C i Representing the number of input channels, k w 、k h Respectively representing the width and height of the convolution kernel.
The algorithm complexity measurement index adopts floating point operation times (FLOAPs), and GFLOPs are billions of floating point operations.
As can be seen from fig. 6, compared with the original algorithm, the improved algorithm has different degrees of improvement on each class of the DIOR data set, which proves the effectiveness of the invention in improving the detection performance of the model in the multi-scale feature fusion and the optimization and improvement of the prediction end.
Table 1 shows the performance evaluation comparison of the proposed detection algorithm with the original detection algorithm on DIOR.
Table 1DIOR dataset detection Performance comparison table
As can be seen from Table 1, the improved network was subjected to performance testing, and the average accuracy AP was measured for each size of the test set S 、AP M 、AP L Respectively 12.8%, 39.0% and 69.8%, and mAP50 reaches 73.68%.Comparing the algorithm provided by the invention with YOLOX-Tiny, AP S 、AP M 、AP L The average accuracy of each size target is respectively improved by 1.5%, 3.7% and 7.1%, and mAP50 is improved by 3.89%, so that the effectiveness of the algorithm of the invention is verified.
Table 2 shows the performance evaluation comparison of the proposed detection algorithm with other target detection algorithms on DIOR.
Table 2 shows the performance evaluation results of the present invention on DIOR and other target detection algorithms
As can be seen from table 2, the mAP50 index of the detection algorithm provided by the invention is ranked first, and still has a competitive detection effect compared with other target detection algorithms.
In conclusion, the invention is based on the YOLOX-Tiny, improves and optimizes the multi-scale feature fusion network structure and the prediction end, and can achieve good detection effect on the premise that the remote sensing image has the characteristics of rich small target object occupation ratio, high background complexity, high similarity and intra-class diversity, large target scale change in different images and the like.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims (7)

1. Remote sensing target detection based on a YOLOX-Tiny biased feature fusion network is characterized by comprising the following steps:
s1, dividing a remote sensing data set DIOR into a test set and a training set according to a certain proportion;
s2, carrying out unified size processing on all pictures in the training set and the testing set;
s3, introducing a multi-scale feature fusion network and deformable convolution on the basis of the YOLOX-Tiny, building a biased feature fusion network, sending a training set into the biased feature fusion network, and training by adopting a SIoU loss function;
s4, inputting the test set into a biased feature fusion network to perform performance test.
2. The remote sensing target detection based on YOLOX-Tiny biased feature fusion network according to claim 1, wherein in step S3, an extra edge is added between the P3 input node and the output node in the multi-scale feature fusion network structure; the convolution structure of the convolution kernel adopts a depth separable convolution structure.
3. The remote sensing target detection method based on the YOLOX-Tiny biased feature fusion network according to claim 1, wherein in step S3, a CBS structure in the prediction end of the original YOLOX-Tiny model is replaced by a deformable convolution, the deformable convolution performs feature extraction by using a parallel conventional convolution to obtain a new feature map with the same length and width as the input feature map and the same channel number of 2N, wherein N is the number of convolution kernel sampling points, and the new feature map contains the offset Δp of x axis and y axis n
And (3) calculating the characteristic value on the shifted coordinate point by utilizing the principle of bilinear interpolation on the offset through the characteristic value of the point on the original input characteristic map, and completing the self-adaptive sampling of the spatial characteristic, wherein the expression of the deformable convolution output characteristic map Y is as follows:
wherein w is convolution kernel weight, X is input feature map, and P n Sampling positions for convolution kernels; p (P) 0 Is at any position.
4. The YOLOX-Tiny biased feature fusion network-based remote sensing target detection of claim 1, wherein in step S3, the SIoU loss function includes shape loss, ioU loss, distance loss, and angle loss included in distance loss;
the definition of the angle loss is as follows:
wherein ,ch For the height difference between the center points of the real frame and the predicted frame, the expression is as follows:
σ is the distance between the center points of the real frame and the predicted frame, and the expression is as follows:
wherein ,is the center coordinates of the real frame, (b) cx ,b cy ) The central coordinate of the prediction frame;
when the angle alpha between the prediction frame and the real frame is 0 DEG or 90 DEG, the angle loss is 0;
in the training process, if the predicted frame B and the real frame B GT The angle alpha between the two angles is smaller than 45 degrees, and the prediction frame is moved towards the direction of reducing alpha; otherwise, moving in the direction of decreasing β, β+α=90°;
the definition of the distance loss is as follows:
wherein ,γ=2-Λ;c w 、c h respectively a prediction frame B and a real frame B GT The width and height of the minimum external frame;
the definition of shape loss is as follows:
wherein ,(w, h) is the width and height of the prediction block, (w) gt ,h gt ) The width and the height of the real frame are shown, and theta is the shape weight;
the definition of the final SIoU bounding box regression loss function is as follows:
wherein IoU =s 1 /S 2 ,S 1 For predicting frame B and true frame B GT Area of intersection S 2 For predicting frame B and true frame B GT Area of the phase.
5. The remote sensing target detection based on the YOLOX-Tiny biased feature fusion network according to claim 1, wherein in step S4, detection accuracy, detection speed and model complexity are used to measure detection performance of the biased feature fusion network on the test set, the detection accuracy comprises an accuracy P, a recall R, an average accuracy AP and an average accuracy mean value mAP, and the model detection speed index adopts a detection image frame number per second FPS.
6. The remote sensing target detection method based on the YOLOX-Tiny biased feature fusion network according to claim 5, wherein the detection precision comprises precision P, recall R, average precision AP and average precision average mAP, and the specific calculation formula is as follows:
TP is the number of positive samples correctly predicted by the model; FP is the number of positive samples for model misprediction; FN is a positive sample predicted by the model to be negative; n is the number of target categories; AP (Access Point) i The average precision for the i-th target.
7. The remote sensing target detection based on the YOLOX-Tiny biased feature fusion network according to claim 5, wherein the detection speed index adopts the number of frames per second of detected image FPS, and the calculation formula is as follows:
wherein N is the number of pictures; t is the detection time;
the model complexity adopts parameter Params, and the calculation formula is as follows:
Params=C o ×(k w ×k h ×C i +1)
wherein :Co Representing the number of output channels, C i Representing the number of input channels, k w 、k h Respectively representing the width and height of the convolution kernel.
CN202310622397.7A 2023-05-30 2023-05-30 Remote sensing target detection based on Yolox-Tiny biased feature fusion network Pending CN116645608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310622397.7A CN116645608A (en) 2023-05-30 2023-05-30 Remote sensing target detection based on Yolox-Tiny biased feature fusion network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310622397.7A CN116645608A (en) 2023-05-30 2023-05-30 Remote sensing target detection based on Yolox-Tiny biased feature fusion network

Publications (1)

Publication Number Publication Date
CN116645608A true CN116645608A (en) 2023-08-25

Family

ID=87618374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310622397.7A Pending CN116645608A (en) 2023-05-30 2023-05-30 Remote sensing target detection based on Yolox-Tiny biased feature fusion network

Country Status (1)

Country Link
CN (1) CN116645608A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117611791A (en) * 2023-10-20 2024-02-27 哈尔滨工业大学 Method for detecting flying target based on feature separation deformable convolution
CN118230079A (en) * 2024-05-27 2024-06-21 中国科学院西安光学精密机械研究所 Detection method for remote sensing small target based on improved YOLO

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117611791A (en) * 2023-10-20 2024-02-27 哈尔滨工业大学 Method for detecting flying target based on feature separation deformable convolution
CN118230079A (en) * 2024-05-27 2024-06-21 中国科学院西安光学精密机械研究所 Detection method for remote sensing small target based on improved YOLO
CN118230079B (en) * 2024-05-27 2024-08-30 中国科学院西安光学精密机械研究所 Detection method for remote sensing small target based on improved YOLO

Similar Documents

Publication Publication Date Title
CN112101434B (en) Infrared image weak and small target detection method based on improved YOLO v3
CN109977918B (en) Target detection positioning optimization method based on unsupervised domain adaptation
CN116645608A (en) Remote sensing target detection based on Yolox-Tiny biased feature fusion network
CN109101897A (en) Object detection method, system and the relevant device of underwater robot
CN111783772A (en) Grabbing detection method based on RP-ResNet network
CN112418108B (en) Remote sensing image multi-class target detection method based on sample reweighing
CN112800955A (en) Remote sensing image rotating target detection method and system based on weighted bidirectional feature pyramid
CN113888461A (en) Method, system and equipment for detecting defects of hardware parts based on deep learning
CN113177592B (en) Image segmentation method and device, computer equipment and storage medium
Zhang et al. A novel tracking method based on improved FAST corner detection and pyramid LK optical flow
CN112288084B (en) Deep learning target detection network compression method based on feature map channel importance
CN110910375A (en) Detection model training method, device, equipment and medium based on semi-supervised learning
CN115100136B (en) Workpiece category and pose estimation method based on YOLOv-tiny model
Kang et al. Yolo-6d+: single shot 6d pose estimation using privileged silhouette information
CN117372604A (en) 3D face model generation method, device, equipment and readable storage medium
CN116205918B (en) Multi-mode fusion semiconductor detection method, device and medium based on graph convolution
CN117197456A (en) HE dyeing-oriented pathological image cell nucleus simultaneous segmentation classification method
CN113643370B (en) NCC algorithm-based image positioning method and device
CN113158806B (en) OTD (optical time Domain _ Logistic) -based SAR (synthetic Aperture Radar) data ocean target detection method
An et al. Segmentation method of magnetic tile surface defects based on deep learning
CN116264016A (en) Lightweight real-time face detection and head posture estimation method and system
CN112529095B (en) Single-stage target detection method based on convolution region re-registration
CN117036918B (en) Infrared target detection method based on domain adaptation
CN117333435B (en) Thyroid nodule boundary definition detection method, thyroid nodule boundary definition detection system, electronic equipment and medium
CN113780319B (en) Closed loop detection method and device and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination