CN110287927A

CN110287927A - Based on the multiple dimensioned remote sensing image object detection method with context study of depth

Info

Publication number: CN110287927A
Application number: CN201910583811.1A
Authority: CN
Inventors: 张向荣; 唐旭; 王少娜; 陈璞花; 古晶; 马文萍; 马晶晶; 侯彪
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-07-01
Filing date: 2019-07-01
Publication date: 2019-09-27
Anticipated expiration: 2039-07-01
Also published as: CN110287927B

Abstract

The invention discloses a kind of based on the multiple dimensioned remote sensing image object detection method with context study of depth, and mainly solution prior art characteristic amalgamation mode is rough, does not consider the utilization of contextual feature information, leads to the problem that detection accuracy is low.Implementation step are as follows: training sample and test sample are obtained in remote sensing image target detection data set；The RetinaNet detection model of multiple dimensioned and contextual feature enhancing is constructed, setting target classification task and goal position returns the loss function of task entirety；Training sample is input in constructed detection model and is trained, obtains trained detection model；Test sample is inputted in trained detection model, prediction output target category, objective degrees of confidence and target position.The present invention improves the ability to express of feature, improves the mean accuracy of remote sensing image target detection, can be used for obtaining the position of interested target and target in a width remote sensing image.

Description

Remote sensing image target detection method based on depth multi-scale and context learning

Technical Field

The invention belongs to the technical field of remote sensing images, and particularly relates to a target detection method for a remote sensing image, which can be used for obtaining an interested target in one remote sensing image and the position of the target.

Background

Remote sensing image target detection is one of important research contents in the field of remote sensing, and is widely applied to the fields of homeland planning, disaster monitoring, military reconnaissance and the like. The purpose of remote sensing image target detection is to judge whether an interested target exists in a remote sensing image and determine the position of the target.

The traditional remote sensing image target detection methods comprise a template matching-based method, a knowledge-based method and a detection object-based method, and the methods rely on a large amount of characteristic engineering to realize the detection of the target in the remote sensing image to a great extent. However, for the complicated and changeable remote sensing image background environment, the target scale difference is obvious, and the like, the adaptability of the methods is not strong. In recent years, a method based on deep learning is widely adopted for remote sensing image target detection. The deep convolutional neural network does not need to design features manually on the aspect of target detection, the remote sensing image data is subjected to feature extraction automatically, and performance exceeds that of a traditional algorithm. The RetinaNet (local for detect Object detection) model has the advantages of no need of generating a candidate region, high target detection speed, high precision and the like. However, the RetinaNet model still has limitations. Because the network architecture adopted by RetinaNet is a feature pyramid network, the feature pyramid network adds and fuses the feature graph of the current layer and the adjacent higher-level feature graph to obtain a feature graph for detecting the target. In this case, the feature fusion mode is rough, and the more effective utilization of the high-level feature map and the utilization of the context information are omitted, which restricts the improvement of the target detection precision of the remote sensing image.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a remote sensing image target detection method based on depth multi-scale and context learning so as to improve the target detection precision in a remote sensing image.

The technical scheme of the invention is as follows: and a multi-scale feature enhancement module and a context feature enhancement module are introduced into the RetinaNet detection model to construct a RetinaNet detection model with multi-scale and context feature enhancement by fully considering the more effective feature map fusion mode and the problem of how to utilize global context feature information. Firstly, feature maps of a plurality of levels are obtained from a backbone network and a feature pyramid network of a RetinaNet detection model, then a multi-scale feature enhancement module is introduced, semantic information of each relatively high-level feature map is guided to adjacent low-level feature maps for the feature maps of the plurality of levels, the semantic information of each relatively low-level feature map is enriched, then a context feature enhancement module is introduced to the pyramid feature map fused with the multi-scale enhancement to obtain global context features of a remote sensing image scene, finally the enhanced pyramid feature map is used in the detection model, and multi-task type determination and target position positioning are achieved through multi-task learning. The concrete implementation steps comprise:

1. a remote sensing image target detection method based on depth multi-scale and context learning is characterized by comprising the following steps:

(1) taking 75% of the remote sensing image target detection data set as a training sample, and taking the remaining 25% as a test sample;

(2) constructing a multiscale and context feature enhanced RetinaNet detection model:

(2a) obtaining 3 convolution characteristic graphs C3, C4 and C5 from a backbone network ResNet-101 of a RetinaNet detection model;

(2b) obtaining 4 pyramid feature maps P3, P4, P5 and P6 from a feature pyramid network of a RetinaNet detection model;

(2c) constructing a multi-scale feature enhancement module consisting of 7 feature maps;

(2d) taking the 3 convolution feature maps C3, C4, C5 and the fourth pyramid feature map P6 as the input of the multi-scale feature enhancement module to obtain 3 pyramid feature maps F3, F4 and F5 after fusion multi-scale enhancement;

(2e) constructing a context feature enhancement module consisting of 5 feature graphs;

(2f) taking the 3 fused multi-scale enhanced pyramid feature maps F3, F4 and F5 as the input of the context feature enhancement module to obtain 3 fused multi-scale context feature enhanced pyramid feature maps G3, G4 and G5;

(3) setting an integral loss function L of a target classification and target position regression task in a multiscale and context feature enhanced RetinaNet detection model:

(3a) setting the existing Focal local function as a Loss function of a target classification task in a multiscale and context feature enhanced RetinaNet detection model, and using L_clsRepresents;

(3b) setting the existing Smooth L1Loss function as a Loss function of a target position regression task in a multiscale and context feature enhanced RetinaNet detection model, and using L_regRepresents:

(3c) loss function L of task classified by target_clsAnd the loss function L of the target position regression task_regSetting the overall loss function L of the multi-scale context feature enhanced RetinaNet detection model as follows:

L＝L({p_i},{t_i})，

wherein,loss function for the target detection task and the target position regression task as a whole, N_clsRepresents the total number of positive sample anchor boxes, p, in the target classification task_iRepresenting the probability that the ith anchor box is the predicted target,representing the probability that the ith anchor box is a true target,a loss function of a target classification task in a multi-scale and context feature enhanced RetinaNet detection model, lambda represents a balance weight parameter between the target classification task and a target position regression task, N_regRepresents the total number of positive sample anchor boxes in the target location regression task,indicates the offset, t, of the ith anchor frame relative to the true target frame_iIndicating the offset of the ith anchor box relative to the predicted target bounding box,i represents the index of an anchor frame, the value range of the index is from 1 to M, and M is the total number of the anchor frames;

(4) training a multiscale and context feature enhanced RetinaNet detection model constructed in the step (2):

(4a) setting the learning rate to be 0.00001, setting Adam by an optimizer, setting the number of training steps to be 2000, setting the number of training rounds to be 100, and using classification model parameters obtained by backbone network ResNet-101 pre-training on an ImageNet data set as initialization parameters of a RetinaNet detection model with multi-scale and context feature enhancement;

(4b) inputting the training samples obtained in the step (1) into a multiscale and context feature enhanced RetinaNet detection model, optimizing the overall loss function L in the step (3c) by using an optimizer Adam, updating weight parameters, and obtaining the multiscale and context feature enhanced RetinaNet detection model containing the weight parameters when the number of training rounds reaches 100;

(5) and inputting the test sample into a multiscale and context feature enhanced RetinaNet detection model containing weight parameters, and predicting and outputting the position of a target boundary box, the target category and the confidence score of the target in the test sample.

Compared with the prior art, the invention has the following advantages:

firstly, in the prior art, a multi-scale feature enhancement module is introduced, which considers the semantic information of the high-level feature map to be efficiently utilized and guides the high-level feature map and the low-level feature map to be fused, so that the low-level feature map has rich semantic information on the premise of keeping the resolution unchanged, the expression of the low-level feature map is enhanced, and the classification confidence of the target is improved.

Secondly, the invention considers the utilization of the global context feature information, introduces a context feature enhancement module, effectively utilizes the complex characteristic of the remote sensing image scene, establishes the relation between the current position and other positions from the feature level, and obtains the global context feature of the remote sensing image scene, thereby improving the target detection precision.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is an image of simulation results of a baseball field test using the present invention and a reference method;

FIG. 3 is a simulation result image of a bridge being tested using the present invention and the baseline method;

FIG. 4 is an image of a simulation result of an aircraft being tested using the present invention and a baseline method.

Detailed Description

The embodiments and effects of the present invention will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the implementation steps of this embodiment are as follows:

step 1, obtaining a training sample and a testing sample.

The method comprises the steps of obtaining an open remote sensing image target detection data set NWPU VHR-10-v2, wherein the data set comprises 1172 remote sensing images with the size of 400 x 400 pixels and corresponding labeled target category and target position data on the remote sensing images, in the embodiment, 75% of data in the remote sensing image target detection data set are used as training samples, the rest 25% of data are used as test samples, namely 879 sample images in the remote sensing image target detection data set are used as the training samples, and the rest 293 images are used as the test samples.

And 2, constructing a multi-scale and context feature enhanced RetinaNet detection model.

2.1) obtaining 3 convolution characteristic graphs C3, C4 and C5 from a backbone network of a RetinaNet detection model:

the backbone network of the RetinaNet detection model comprises ResNet-50, ResNet-101 and ResNet-152, in the embodiment, the backbone network ResNet-101 is used, namely 3 convolution characteristic graphs C3, C4 and C5 are obtained from the backbone network ResNet-101 of the RetinaNet detection model;

2.2) obtaining 4 pyramid feature maps P3, P4, P5 and P6 from the feature pyramid network of the RetinaNet detection model;

2.3) constructing a multi-scale feature enhancement module consisting of 7 feature maps:

2.3.1) constructing 2 feature maps, wherein the first is a high-level feature map T1 and the second is a low-level feature map T2;

2.3.2) takes 2 branch operations in parallel on the first high-level feature map T1:

sequentially passing the first branch through a global average pooling layer, a dimension conversion layer, a 1 × 1 convolutional layer with a first step length of 1 and a first up-sampling layer to obtain a low-level feature map T3 containing global context information;

the second branch passes through a second 1 × 1 convolutional layer with the step size of 1 and a second up-sampling layer in sequence to obtain an up-sampled low-level characteristic diagram T4;

2.3.3) inputting the second low-level feature map T2 into the 3 × 3 convolutional layer with the step length of 1, and outputting to obtain a low-level feature map T5 after channel conversion;

2.3.4) inputting the low-level feature map T3 containing the global context information and the low-level feature map T5 after channel transformation into a fusion multiplication layer to obtain a fusion multiplied low-level feature map T6;

2.3.5) inputting the fused multiplied low-level feature map T6 and the up-sampled low-level feature map T4 into a fused addition layer to obtain a multi-scale enhanced feature map T7;

2.4) taking the 3 convolution feature maps C3, C4, C5 and the fourth pyramid feature map P6 as the input of the multi-scale feature enhancement module to obtain 3 fused multi-scale enhanced pyramid feature maps F3, F4 and F5:

2.4.1) inputting the second convolution feature map C4 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting the first convolution feature map C3 as a low-level feature map T1 in the multi-scale enhancement module, and outputting to obtain a multi-scale enhanced first feature map E3;

2.4.2) adding and fusing the multi-scale enhanced first feature map E3 and the first pyramid feature map P3 to obtain a fused multi-scale enhanced first pyramid feature map F3;

2.4.3) inputting the third convolution feature map C5 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting the second convolution feature map C4 as a low-level feature map T1 in the multi-scale feature enhancement module, and outputting to obtain a second feature map E4 after multi-scale enhancement;

2.4.4) adding and fusing the multi-scale enhanced second feature map E4 and the second pyramid feature map P4 to obtain a fused multi-scale enhanced second pyramid feature map F4;

2.4.5) inputting the fourth pyramid feature map P6 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting the third convolution feature map C5 as a low-level feature map T1 in the multi-scale feature enhancement module, and outputting to obtain a multi-scale enhanced third feature map E5;

2.4.6) adding and fusing the multi-scale enhanced third feature map E5 and the third pyramid feature map P5 to obtain a fused multi-scale enhanced third pyramid feature map F5;

2.5) constructing a context feature enhancement module consisting of 5 feature maps:

2.5.1) constructing a fused multi-scale enhanced pyramid feature map S1, and sequentially passing the pyramid feature map through a 1 × 1 convolution layer with a first step size of 1 and a softmax layer to obtain an activated pyramid feature map S2;

2.5.2) inputting the activated pyramid feature map S2 and the fused multi-scale enhanced pyramid feature map S1 into a first fusion multiplication layer to obtain a pyramid feature map S3 after fusion multiplication;

2.5.3) sequentially passing the pyramid feature map S3 after fusion multiplication through a 1 × 1 convolution layer with a second step size of 1, a modified linear unit layer and a 1 × 1 convolution layer with a third step size of 1 to obtain a modified and fused pyramid feature map S4;

2.5.4) inputting the modified fused pyramid feature map S4 and the fused multi-scale enhanced pyramid feature map S1 into a second fusion multiplication layer to obtain a fused context feature enhanced pyramid feature map S5;

2.6) using the 3 pyramid feature maps F3, F4 and F5 after the fusion multi-scale enhancement as the input of the context feature enhancement module to obtain 3 pyramid feature maps G3, G4 and G5 after the fusion multi-scale context feature enhancement:

2.6.1) inputting the fused multi-scale enhanced first pyramid feature map F3 as a feature map S1 of a context feature enhancement module to obtain a fused context feature enhanced first pyramid feature map G3;

2.6.2) inputting the fused multi-scale enhanced second pyramid feature map F4 as a feature map S1 of a context feature enhancement module to obtain a fused context feature enhanced second pyramid feature map G4;

2.6.3) inputting the fused multi-scale enhanced third pyramid feature map F5 as the feature map S1 of the context feature enhancement module to obtain a fused context feature enhanced third pyramid feature map G5.

And 3, setting an integral loss function L of a target classification and target position regression task in the constructed multi-scale and context feature enhanced RetinaNet detection model.

3.1) setting the existing Focal local function as a Loss function of a target classification task in a multiscale and context feature enhanced RetinaNet detection model, and using L_clsExpressed as:

L_cls＝FL(p_i)，

wherein, FL (p)_i)＝-α(1-p_i)^γ×log(p_i) Representing the focus loss function, α representing the equilibrium parameter for positive and negative samples, gamma representing the concentration parameter, p_iThe probability that the ith anchor frame is a prediction target is represented, i represents the index of the anchor frame, the value range of i is from 1 to M, and M is the total number of the anchor frames;

in this example, α is set to 0.25, γ is set to 2.0;

3.2) setting the existing Smooth L1Loss function as a Loss function of a target position regression task in a multiscale and context feature enhanced RetinaNet detection model, and using L_regExpressed as:

L_reg＝Smooth_L1(x)，

wherein, Smooth_L1(x) Representing a smoothed L1 squared loss function, represents the offset t of the ith anchor frame relative to the predicted target frame_iOffset of ith anchor frame relative to real target frameA difference of (d);

3.3) loss function L of the task of classification by target_clsAnd the loss function L of the target position regression task_regSetting the overall loss function L of the multi-scale context feature enhanced RetinaNet detection model as follows:

L＝L({p_i},{t_i})，

wherein,loss function for the target detection task and the target position regression task as a whole, N_clsRepresenting the total number of positive sample anchor boxes in the target classification task,representing the probability that the ith anchor box is a true target,a loss function of a target classification task in a multi-scale and context feature enhanced RetinaNet detection model, lambda represents a balance weight parameter between the target classification task and a target position regression task, N_regRepresents the total number of positive sample anchor boxes in the target location regression task,indicates the offset, t, of the ith anchor frame relative to the true target frame_iIndicating the offset of the ith anchor box relative to the predicted target bounding box,a loss function of a target position regression task in a multiscale and context feature enhanced RetinaNet detection model;

in this embodiment, λ is 1.

And 4, training the multiscale and context feature enhanced RetinaNet detection model constructed in the step 2.

4.1) setting training parameters:

in this embodiment, the learning rate is set to 0.00001, Adam is used by the optimizer, the number of training steps is set to 2000, the number of training rounds is set to 100, and classification model parameters obtained by using backbone network ResNet-101 pre-training are used on the ImageNet data set as initialization parameters of a multiscale and context feature enhanced retannet detection model;

4.2) inputting the training samples in the step 1 into a multi-scale and context feature enhanced RetinaNet detection model, optimizing the overall loss function L in the step 3 by using an optimizer Adam, updating the weight parameters, and obtaining the multi-scale and context feature enhanced RetinaNet detection model containing the weight parameters when the number of training rounds reaches 100.

And 5, inputting the test sample in the step 1 into a multiscale and context feature enhanced RetinaNet detection model containing weight parameters, and predicting and outputting the position of a target boundary frame, the target type and the confidence score of the target in the test sample image.

The effect of the invention can be further illustrated by the following simulation experiment:

simulation conditions and contents

The simulation adopts a public NWPU VHR-10-v2 data set widely applied to performance evaluation of a remote sensing image target detection algorithm to train and test a RetinaNet detection model with multi-scale and context feature enhancement, and the adopted benchmark method is the RetinaNet detection model.

Let the NWPU VHR-10-v2 dataset include 10 object classes, respectively: airplanes, ships, oil storage tanks, baseball fields, basketball fields, tennis courts, playgrounds, ports, vehicles, and bridges.

The processor used for simulation isXeon(R)CPU E5-2630v4@2.20GHz×40，The memory is 64.00GB, the GPU is 8G GeForce GTX1080, the simulation platform is an Ubuntu16.04 operating system, a Keras deep learning framework is used, and Python language is adopted for realization.

Second, simulation content

Simulation 1: the detection simulation of the baseball field using the present invention and the existing reference method has the result shown in fig. 2, and as can be seen from fig. 2, the classification confidence score of the baseball field of the reference method is 0.929, as shown in fig. 2(a), the classification confidence score of the baseball field of the present invention reaches 1.000, as shown in fig. 2(b), compared with the reference method, the classification performance of the baseball field of the present invention is relatively obviously improved.

Simulation 2: the bridge detection simulation is carried out by using the method and the existing benchmark method, the result is shown in fig. 3, the classification confidence scores of 2 bridges in the benchmark method are respectively 0.660 and 0.850, as shown in fig. 3(a), the classification confidence scores of 2 bridges in the method respectively reach 0.974 and 0.927, as shown in fig. 3(b), compared with the benchmark method, the method has obvious improvement on the classification confidence scores of the bridges, and the method is mainly characterized in that the expression of the context characteristics is enhanced by introducing a context characteristic enhancement module due to strong dependence of the bridges on context information of a scene.

Simulation 3: the results of the detection simulation of 5 airplanes by using the present invention and the existing benchmark method are shown in fig. 4, and it can be seen from fig. 4 that the classification confidence scores of 5 airplanes in the benchmark method are all 1.000, as shown in fig. 4(a), and the classification confidence scores of 5 airplanes in the present invention are all 1.000, as shown in fig. 4(b), which indicates that the benchmark method and the present invention have good performance for airplane classification.

Third, comparing and analyzing simulation experiment results

To verify the effectiveness of the present invention, 3 existing methods were set up, of which: the existing method 1 is a RetinaNet detection model; the existing method 2 is a remote sensing image target detection model with rotation insensitivity and context enhancement; the existing method 3 is a remote sensing image target detection model with multi-model decision fusion.

The mean average precision is used as an evaluation index when all target types are detected, the average precision is used as an evaluation index when a single-type target is detected, the target on the NWPU VHR-10-v2 test data set is subjected to detection simulation by using the method and 3 existing methods, and the numerical results of the detected evaluation indexes are compared, as shown in Table 1.

TABLE 1 comparison of evaluation index values measured by the present invention and 3 conventional methods

In table 1, the comparison of the evaluation index numerical results detected by the present invention and 3 existing methods, the results of the average precision of the multi-target detection and the average precision of each category are both decimal numbers, and bold represents the highest average precision of the detection of the category target in the above four methods.

According to table 1, the following 3 conclusions are obtained in comparison of the evaluation index numerical results detected by the invention and 3 existing methods:

1) the average precision of the mean value of the existing method 1 is 0.9150, the average precision of the mean value of the invention is 0.9551, and the average precision of the mean value of the invention is improved by 0.0401 compared with the average precision of the mean value of the existing method 1;

2) the average accuracy of 6 types of targets is higher than that of 8 types of targets in the prior method 1, particularly for bridges and basketball courts, the average accuracy is obviously improved, mainly because the bridges and the basketball courts have stronger dependence on context information, the introduced context characteristic enhancement module enhances the expression of context characteristics, the average accuracy of ship detection is also improved, and mainly because the scale change of ships is large, the introduced multi-scale characteristic enhancement module enhances the expression of multi-scale characteristics of the targets;

3) for the existing method 2 and the existing method 3, both belong to two-step target detection models, and the invention belongs to a single-step target detection model, generally, the average precision of the mean value of the two-step target detection model is higher than that of the single-step target detection model, and the comparison of the evaluation index numerical results of the detection shows that the average precision of the mean value of the invention is higher than that of the existing method 2 and the existing method 3.

In summary, the invention introduces a multi-scale feature enhancement module on the basis of the existing RetinaNet detection model, guides semantic information on a high-level feature map to a low-level feature map, enriches the semantic information of the low-level feature map, further introduces a context feature enhancement module, and finally applies the RetinaNet detection model introduced with the multi-scale and context feature enhancement module to target detection, outputs a detection result, and improves the precision of remote sensing image target detection.

The foregoing description is only an example of the present invention and is not intended to limit the invention, so that it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

L＝L({p_i},{t_i})，

2. The method of claim 1, wherein (2c) constructs a multi-scale feature enhancement module consisting of 7 feature maps, which is implemented as follows:

(2c1) constructing 2 feature maps, wherein the first is a high-level feature map T1 and the second is a low-level feature map T2;

(2c2) take 2 branch operations side by side on the first high level feature graph:

sequentially passing the first branch through a global average pooling layer, a dimension conversion layer, a first 1 × 1 convolutional layer with the step length of 1 and a first up-sampling layer to obtain a low-level feature map T3 containing global context information;

(2c3) inputting the second low-level feature map T2 into the 3 × 3 convolutional layer with step length of 1, and outputting to obtain a low-level feature map T5 after channel conversion;

(2c4) inputting the low-level feature map T3 containing global context information and the channel-transformed low-level feature map T5 into a fusion multiplication layer to obtain a fusion-multiplied low-level feature map T6,

(2c5) and inputting the fused and multiplied low-level feature map T6 and the up-sampled low-level feature map T4 into a fused addition layer to obtain a multi-scale enhanced feature map T7.

3. The method of claim 1 or 2, wherein (2d) 3 convolution feature maps C3, C4, C5 and a fourth pyramid feature map P6 are used as input to the multi-scale feature enhancement module to obtain 3 fused multi-scale enhanced pyramid feature maps F3, F4 and F5, which are implemented as follows:

(2d1) inputting the second convolution feature map C4 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting the first convolution feature map C3 as a low-level feature map T1 in the multi-scale enhancement module, and outputting to obtain a multi-scale enhanced first feature map E3;

(2d2) adding and fusing the multi-scale enhanced first feature map E3 and the first pyramid feature map P3 to obtain a fused multi-scale enhanced first pyramid feature map F3;

(2d3) inputting a third convolution feature map C5 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting a second convolution feature map C4 as a low-level feature map T1 in the multi-scale feature enhancement module, and outputting to obtain a second feature map E4 after multi-scale enhancement;

(2d4) adding and fusing the multi-scale enhanced second feature map E4 and the second pyramid feature map P4 to obtain a fused multi-scale enhanced second pyramid feature map F4;

(2d5) inputting the fourth pyramid feature map P6 as a high-level feature map T2 in the multi-scale feature enhancement module, inputting the third convolution feature map C5 as a low-level feature map T1 in the multi-scale feature enhancement module, and outputting to obtain a multi-scale enhanced third feature map E5;

(2d6) and adding and fusing the multi-scale enhanced third feature map E5 and the third pyramid feature map P5 to obtain a fused multi-scale enhanced third pyramid feature map F5.

4. The method of claim 1, wherein (2e) constructs a context feature enhancement module consisting of 5 feature maps, which is implemented as follows:

(2e1) constructing a fused multi-scale enhanced pyramid feature map S1, and sequentially passing the pyramid feature map through a 1 × 1 convolution layer with a first step size of 1 and a softmax layer to obtain an activated pyramid feature map S2;

(2e2) inputting the activated pyramid feature map S2 and the fused multi-scale enhanced pyramid feature map S1 into a first fusion multiplication layer to obtain a pyramid feature map S3 after fusion multiplication;

(2e3) sequentially passing the fused and multiplied pyramid feature map S3 through a second 1 × 1 convolution layer with the step size of 1, a modified linear unit layer and a third 1 × 1 convolution layer with the step size of 1 to obtain a modified and fused pyramid feature map S4;

(2e4) inputting the modified fused pyramid feature map S4 and the fused multi-scale enhanced pyramid feature map S1 into a second fusion multiplication layer to obtain a fused context feature enhanced pyramid feature map S5.

5. The method of claim 1, wherein 3 fused multi-scale enhanced pyramid feature maps F3, F4, and F5 are used as input of the context feature enhancement module in (2F), resulting in 3 fused multi-scale context feature enhanced pyramid feature maps G3, G4, and G5, which are implemented as follows:

(2f1) inputting the fused multi-scale enhanced first pyramid feature map F3 as a feature map S1 of a context feature enhancement module to obtain a fused context feature enhanced first pyramid feature map G3;

(2f2) inputting the second pyramid feature map F4 subjected to fusion multi-scale enhancement as a feature map S1 of a context feature enhancement module to obtain a second pyramid feature map G4 subjected to fusion context feature enhancement;

(2f3) and inputting the third pyramid feature map F5 subjected to fusion multi-scale enhancement as a feature map S1 of a context feature enhancement module to obtain a third pyramid feature map G5 subjected to fusion context feature enhancement.

6. The method according to claim 1, wherein (3a) the existing Focal local function is set as a Loss function L of a target classification task in a multiscale and context feature enhanced RetinaNet detection model_clsIt is expressed as follows:

L_cls＝FL(p_i)，

wherein, FL (p)_i)＝-α(1-p_i)^γ×log(p_i) Representing the focus loss function, α representing the balance parameters of the positive and negative samplesγ denotes concentration parameter, p_iRepresenting the probability that the ith anchor box is the predicted target.

7. The method according to claim 1, wherein (3b) the existing Smooth L1Loss function is set as the Loss function L of the target position regression task in the multiscale and context feature enhanced RetinaNet detection model_regIt is expressed as follows:

L_reg＝Smooth_L1(x)，

wherein, Smooth_L1(x) Representing a smoothed L1 squared loss function,represents the offset t of the ith anchor frame relative to the predicted target frame_iOffset of ith anchor frame relative to real target frameThe difference of (a).