CN114648736B

CN114648736B - Robust engineering vehicle identification method and system based on target detection

Info

Publication number: CN114648736B
Application number: CN202210538060.3A
Authority: CN
Inventors: 王中元; 李云浩; 陈世杰; 邵振峰; 何政; 邓练兵
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-05-18
Filing date: 2022-05-18
Publication date: 2022-08-16
Anticipated expiration: 2042-05-18
Also published as: CN114648736A

Abstract

The invention discloses a robust engineering vehicle identification method and a robust engineering vehicle identification system based on target detection.A multi-scale feature extraction network is adopted to extract a feature map from a video frame image to be identified, and a feature enhancement network based on an attention mechanism is used for feature enhancement to obtain an enhanced feature map; and then inputting the acquired enhanced feature map into a position detection network to predict a target position, filtering low-quality position prediction through post-processing, extracting an interested region in the feature map according to the predicted target position, and then inputting a cascaded position correction and category prediction network and performing post-processing to obtain a final target position and type. The method and the device can accurately detect the position and the type of the engineering vehicle in the scene in the power grid monitoring video with complex scene changes, thereby providing an automatic external damage behavior monitoring means, reducing the labor cost and ensuring the safety of a power grid system.

Description

Robust engineering vehicle identification method and system based on target detection

Technical Field

The invention belongs to the technical field of computer vision, relates to a computer aided engineering vehicle detection method and system, and particularly relates to a robust engineering vehicle identification method and system based on target detection.

Technical background

Transportation of power resources depends on establishing a safe and powerful power grid system, and in reality, outdoor power transmission lines often have a plurality of potential safety hazards. The external damage caused by construction vehicles is the most common human unsafe factor, such as damage to a high-voltage telegraph pole caused by illegal construction of an excavator, mistaken touch of a high-voltage electric wire due to improper operation of a crane and the like. In order to avoid the accidents, related departments place a plurality of monitoring cameras on the power transmission line of the power grid, but the cost and the negligence results are difficult to bear only by manpower monitoring. If the target detection technology can be applied to video monitoring, the engineering vehicles in the monitoring scene are automatically analyzed and identified by utilizing the algorithm, so that the labor cost is reduced, false alarm and missed alarm caused by human negligence are eliminated, a measure for giving out warning before the accident occurs, assisting in handling when the accident occurs and conveniently and timely obtaining evidence after the accident occurs is provided, and the stable operation of the power grid is ensured.

Object detection is a hot direction for computer vision and digital image processing. In recent years, new ideas for detecting targets based on key points or based on central points without using anchor frames have emerged in the field of target detection. CornerNet firstly regards target detection as a key point detection and combination task, positions a pair of key points, namely an upper left corner point and a lower right corner point of a rectangular region where a target is located through a thermodynamic diagram, and then performs key point combination pairing to obtain a target boundary frame; the extreme points of the top point, the bottom point, the left point and the right point of the target and five key points of the central point are detected by the extreme points, and the four extreme points are grouped through the central point to obtain the position prediction of the target; the CenterNet directly simplifies the target detection task into that the geometric center point of a target boundary box is predicted by adopting a key point detection mode, and then the size of the boundary box is regressed, so that the target is positioned; the CenterNet2 integrates the CenterNet into a two-stage framework, and trains by adopting a probability optimization target, combines the advantages of an anchor-frame-free single-stage algorithm and a two-stage algorithm, accelerates the reasoning time of the two-stage algorithm, and greatly improves the detection precision of the model.

In a power grid transmission line monitoring video in a real scene, the scene or background of an image is very complex, different weather, illumination and other conditions can be involved, the image is limited by diversified software and hardware parameters of different cameras, the captured images are different in view angle, definition, resolution and other aspects, and the scales of different target examples in the image are also different greatly. In addition, the problem of overlapping or densely distributing extremely small objects, sheltered objects and a plurality of objects is very troublesome, and the engineering vehicle has the characteristic of variable forms. How to overcome these problems, detect and identify the engineering vehicle with higher precision, and obtain a robust detection and identification system is a difficult problem faced by the prior art.

Disclosure of Invention

In order to solve the technical problems, the invention provides a robust engineering vehicle identification method and system based on target detection by combining a target detection algorithm based on a central domain.

The method adopts the technical scheme that: a robust engineering vehicle identification method based on target detection comprises the following steps:

step 1: extracting a feature map from a video frame image to be identified by adopting a multi-scale feature extraction network, and performing feature enhancement by using a feature enhancement network based on an attention mechanism to obtain an enhanced feature map;

step 2: inputting the enhanced feature map obtained in the step 1 into a position detection network so as to predict the target position of the engineering vehicle, filtering low-quality position prediction through post-processing, then extracting an interested area in the feature map according to the predicted target position, and then inputting a cascaded position correction and category prediction network and performing post-processing to obtain the final target position and the type of the engineering vehicle.

The technical scheme adopted by the system of the invention is as follows: a robust engineering vehicle identification system based on target detection comprises the following modules:

the module 1 is used for extracting a feature map from a video frame image to be identified by adopting a multi-scale feature extraction network, and performing feature enhancement by using a feature enhancement network based on an attention mechanism to obtain an enhanced feature map;

and the module 2 is used for inputting the enhanced feature map acquired by the module 1 into a position detection network so as to predict the target position of the engineering vehicle, filtering low-quality position prediction through post-processing, then extracting an interested region in the feature map according to the predicted target position, and then inputting the cascaded position correction and category prediction network and performing post-processing to obtain the final target position and type of the engineering vehicle.

Compared with the existing detection method, the method has the following advantages and positive effects:

(1) the feature extraction operation in the invention is divided into two parts of extraction and enhancement, the enhancement operation enables the extracted features to be more accurate and effective, effective reference information is provided for subsequent modules, and in addition, the detection precision is higher by using a cascade position correction and category prediction module to adjust the predicted categories and positions for many times.

(2) The invention applies the detection algorithm without the anchor frame, does not need to adjust parameters related to the anchor frame, and reduces the extra calculation amount and reasoning time caused by densely laying the anchor frame on the image.

(3) The method has better effect on the extremely small target and the deformed complex target.

Drawings

FIG. 1: a method flowchart of an embodiment of the invention.

FIG. 2: the multi-scale feature extraction network structure chart provided by the embodiment of the invention.

FIG. 3: the feature enhancement network structure diagram of the embodiment of the invention.

FIG. 4: the position increase detection network result diagram of the embodiment of the invention.

FIG. 5: the invention discloses a network structure diagram for position correction and category prediction.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and the implementation examples, it is to be understood that the implementation examples described herein are only for the purpose of illustration and explanation and are not to be construed as limiting the present invention.

Referring to fig. 1, the robust engineering vehicle identification method based on target detection provided by the invention comprises the following steps:

in this embodiment, the specific implementation of step 1 includes the following substeps:

step 1.1: the video frame image to be identified is

Inputting a multi-scale feature extraction network, wherein the scale

So as to obtain a group of original feature maps with different scales

，ChIs the number of channels; in this exampleChA value of 256;

step 1.2: the original feature map is obtained

Inputting a feature enhancement network based on an attention mechanism in a scale-by-scale manner, enhancing the response of the interested region in the feature map, and obtaining a group of enhanced feature maps

。

And 2, step: inputting the enhanced feature map obtained in the step 1 into a position detection network so as to predict the target position of the engineering vehicle, filtering low-quality position prediction through post-processing, then extracting an interested region in the feature map according to the predicted target position, and then inputting a cascaded position correction and category prediction network and performing post-processing to obtain the final target position and type of the engineering vehicle.

In this embodiment, the specific implementation of step 2 includes the following substeps:

step 2.1: entering the enhanced feature graph one by one based on the centerrNet improved position detection network with high recall rate and strong reliable foreground and background classification capability, thereby obtaining N pairs (large quantity) of position prediction maps corresponding to input images

And key point thermodynamic diagrams

Here, each position of the position prediction map corresponds to a position prediction value

Respectively representing the distance from the target center point to the upper left side, the lower right side and the four lower left sides of the circumscribed rectangle, wherein the confidence coefficient of the target key point of the engineering vehicle exists at the corresponding position of each position indication image of the key point thermodynamic diagram;

step 2.2: extracting K (a larger integer) positions with peak responses from the key point positioning thermodynamic diagram as predicted central domain key points of the engineering vehicle target

And predicting the corresponding position from the key points

K rectangular prediction areas possibly containing any engineering vehicle are obtained through calculation

Using D-IoU NMS to filter partial low-quality prediction results to obtain no more than M rectangular prediction areas; wherein M is a preset value;

respectively representing the coordinates of the upper left corner points of the rectangular prediction region

And width

Height of

，

，

(ii) a In this embodiment, the value of M during training and reasoning is 4000 and 1000, respectively;

step 2.3: cutting and pooling the enhanced feature map obtained in the step 1 according to the rectangular prediction area calculated in the step 2.2 to obtain a feature map only containing a rectangular region of interest with uniform size; in this embodiment, the size of the cut enhanced feature map is 14 × 14;

step 2.4: inputting the characteristic diagram obtained in the step 2.3 into a cascade position correction and class prediction network to obtain a positioning precision perception classification confidence coefficient

And position correction value

Here, theN _cls The number of the categories of the construction vehicle is represented,

representing a rectangular region

Correction values of coordinates of the upper left corner and width and height; in this embodiment, the positioning accuracy perception classification confidence is an average value obtained by three corrections.

The position correction value predicted when the cascade position correction and the class prediction network are corrected for the first time in step 2.4 is applied to the rectangular prediction area obtained in step 2.2

The second correction applies the position correction value predicted for the second time to the rectangular prediction region corrected for the first time, and the third correction applies the position correction value predicted for the third time to the rectangular prediction region corrected for the second time. And then filtering redundant rectangular prediction regions by using D-IoUNMS for the result after the last correction, and shielding the rectangular prediction regions lower than the threshold value through a confidence threshold value to obtain the final detection results of all the engineering vehicles in the input image, namely the types of the engineering vehicles and the positions of the rectangular regions where the engineering vehicles are located. In the embodiment, the maximum number of the residual rectangular prediction regions after the D-IoUNMS filtering is respectively set to 2000 and 256 in the training and reasoning stages; the confidence threshold is 0.6.

Referring to fig. 2, the multi-scale feature extraction network adopted in the present embodiment is composed of a deep convolutional neural network and a multi-scale feature fusion layer; the last 6 layers of output of the deep convolutional neural network are convolved to generate feature maps C2, C3, C4, C5, C6 and C7 with the same channel number; c7 and C6 output P7 and P6 after passing through the multi-scale feature fusion layer; the P5 is obtained by splicing and convolution-fusing the result of the up-sampling of the P6 and the output of the attention gate module, wherein the attention gate module uses C5 and P6 as input to generate a feature map after the attention is exerted; the P4 is obtained by splicing and convolution-fusing the result of the up-sampling of the P5 and the output of the attention gate module, wherein the attention gate module uses C4 and P5 as input to generate a feature map after the attention is exerted; the P3 is obtained by splicing and convolution-fusing the result of the up-sampling of the P4 and the output of the attention gate module, wherein the attention gate module uses C3 and P4 as input to generate a feature map after the attention is exerted; the P2 is obtained by splicing the result of the upsampling of the P3 and the output of the attention gate module and carrying out convolution fusion, wherein the attention gate module uses C2 and P3 as inputs to generate a feature map after the attention is exerted.

Referring to fig. 3, the feature enhancement network adopted in the present embodiment includes an attention gate module, a cross-layer fusion scale perception attention module, a spatial perception self-attention module, and a task perception channel attention module;

in the feature enhancement network of the embodiment, P2, P3, P4, P5, P6 and P7 output by the multi-scale feature extraction network are used as input to obtain intermediate results a2, A3, a4, a5, A6 and a 7; wherein, A2 and A7 are directly obtained from P2 and P7; a3 is obtained by splicing the down-sampling result of A2 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P3 and a lower layer feature map A2 as input to generate a feature map after applying attention; a4 is obtained by splicing the down-sampling result of A3 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P4 and a lower layer feature map A3 as input to generate a feature map after attention is exerted; a5 is obtained by splicing the down-sampling result of A4 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P5 and a lower layer feature map A4 as input to generate a feature map after attention is exerted; a6 is obtained by splicing the down-sampling result of A5 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P6 and a lower layer feature map A5 as input to generate a feature map after attention is exerted; intermediate results A2, A3, A4, A5, A6 and A7 pass through a cross-layer fusion scale perception attention module, a spatial perception self-attention module and a task perception channel attention module which are connected in series to obtain final output feature maps F2, F3, F4, F5, F6 and F7.

Referring to fig. 4, in the position detection network adopted in this embodiment, input data sequentially passes through the deformable rolling block, the dynamic linear modification unit, the rolling block, the deformable rolling block, and the dynamic linear modification unit, and then output respectively passes through two paths, one path is input into the rolling block and the Scale layer to obtain a position prediction map, and the other path passes through the rolling block to obtain a key point thermodynamic map.

The position detection network of the present embodiment generates a position prediction map and a key point thermodynamic map by using an enhanced feature map output by a feature enhancement network as an input. The position prediction graph and the key point thermodynamic diagrams have the same size as the input enhanced feature graph, the former gives a predicted value of the position of the engineering vehicle target corresponding to each key point position in the input feature graph, and the latter predicts the confidence coefficient of the key point of the engineering vehicle target at each position in the input feature graph. The positions of the key points of the engineering vehicle can be obtained through the peak value response points on the key point thermodynamic diagram, and the corresponding position prediction values given by the position prediction diagram are combined, so that the positions of the rectangular prediction areas suspected of containing the engineering vehicle targets are obtained.

Referring to fig. 5, the location correction and category prediction network adopted in this embodiment includes a first correction module, a second correction module, and a third correction module, which are connected in sequence, where the first correction module, the second correction module, and the third correction module are all composed of a clipping and pooling layer, a location sensing category prediction branch, and a location correction branch;

the position correction and category prediction network of the embodiment inputs a rectangular prediction area output by the position prediction network and an enhanced feature map output by a feature enhancement network; when first correction is carried out in a first correction module, cutting out sub-regions of an enhanced feature map through a rectangular prediction region, pooling the sub-regions into feature maps with uniform sizes, inputting a positioning perception type prediction branch and a position correction branch to obtain a positioning perception classification confidence coefficient and a position correction value, and applying the position correction value to the rectangular prediction region to obtain a new corrected rectangular prediction region; the second correction is carried out in the second correction module, the third correction is carried out in the third correction module, and the first correction is carried out in the first correction module; the final confidence of the location-aware classification is an average of the confidences obtained by the three corrections, and the final detection result of the location correction and class prediction network of the embodiment is the result after the last correction.

The multi-scale feature extraction network of the embodiment is a well-trained multi-scale feature extraction network; the training process comprises the following steps:

(1) constructing a self-supervision training positive sample pair;

aiming at label-free self-supervision training data, in the process of each round of iterative training, N pictures are randomly extracted without being replaced to be used as a group, and each image in the group is randomly subjected to the following data augmentation operations twice, including cutting, color changing, shielding and rotating, to form a pair of positive sample images

Thereby obtaining a set of positive sample image pairs

(ii) a Wherein the content of the first and second substances,

and

namely, each image is randomly amplified twice to obtain an image pair;

(2) the obtained positive sample image pair

Inputting the data into a multi-scale feature extraction network to obtain a group of positive sample feature map pairs

Then inputting the characteristic diagram pairs into a characterization mapping layer composed of a full connection layer and a linear correction unit one by one to obtain a group of positive sample characterization pairs

(ii) a Wherein the content of the first and second substances,Srepresenting the number of different scale feature graphs in the multi-scale feature extraction network;

(3) self-supervision training multi-scale feature extraction network;

for each scale, respectively calculating pairwise similarity of all positive sample characterization pairs, and then averaging; n pairs of token pairs, 2N token vectors;

and respectively calculating the loss of pairwise similarity of positive sample characterization pairs by each scale as follows:

；

wherein the content of the first and second substances,

to indicate a function whenk=iTaking 0 when the current value is zero, or taking 1 when the current value is zero;

to characterize pairs

The degree of similarity of the cosine of (c),

is a hyper-parameter;

the final comparison and self-supervision learning pre-task loss function is as follows:

；

and maximizing a contrast loss function, and obtaining a well-trained multi-scale feature extraction network.

In the embodiment, the multi-scale feature extraction network, the feature enhancement network, the position detection network and the cascade position correction and category prediction network are a trained integral network;

(1) constructing a training data set:

randomly performing a series of random data enhancement processing on the labeled training data to obtain an augmented training data set;

(2) training an integral network:

the training targets of the overall network of the embodiment are as follows:

；

in the formula (I), the compound is shown in the specification,

the target is trained for the foreground class,

training a target for a background；C _k The category of the work vehicle is represented,

a background class is represented that is, for example,

indicating that the position detection network produces a foreground rectangular area,

indicating that the position detection network generates a background rectangular area;

the loss function of the location detection network of this embodiment is:

；

wherein the content of the first and second substances,l _loc for the loss of Distance-IoU,l _hm subscript for binary focal losses

Representing a real label value corresponding to the predicted value;boxrepresenting the rectangular area prediction value obtained by the location detection network,hma key point thermodynamic diagram representing a location detection network prediction;

the loss function of the cascaded position correction and class prediction network of the embodiment is as follows:

；

wherein K represents the number of corrections;l _cls in order for the softmax cross-entropy loss,l _reg subscript for smoothL1 loss

Representing a real label value corresponding to the predicted value;clsjoint prediction of positioning accuracy and classes representing positioning-aware class prediction branch generationThe value of the measured value is measured,deltaa position correction value representing a position corrected branch prediction.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A robust engineering vehicle identification method based on target detection is characterized by comprising the following steps:

the multi-scale feature extraction network consists of a deep convolutional neural network and a multi-scale feature fusion layer; the last 6 layers of output of the deep convolutional neural network are convolved to generate feature maps C2, C3, C4, C5, C6 and C7 with the same channel number; c7 and C6 output P7 and P6 after passing through the multi-scale feature fusion layer; the P5 is obtained by splicing and convolution-fusing the result of the up-sampling of the P6 and the output of the attention gate module, wherein the attention gate module uses C5 and P6 as input to generate a feature map after the attention is exerted; the P4 is obtained by splicing and convolution-fusing the result of the up-sampling of the P5 and the output of the attention gate module, wherein the attention gate module uses C4 and P5 as input to generate a feature map after the attention is exerted; the P3 is obtained by splicing and convolution-fusing the result of the up-sampling of the P4 and the output of the attention gate module, wherein the attention gate module uses C3 and P4 as input to generate a feature map after the attention is exerted; the P2 is obtained by splicing the result of the up-sampling of P3 and the output of an attention gate module and then carrying out convolution fusion, wherein the attention gate module uses C2 and P3 as input to generate a characteristic diagram after applying attention;

the feature enhancement network comprises an attention gate module, a cross-layer fusion scale perception attention module, a space perception self-attention module and a task perception channel attention module;

the feature enhancement network takes P2, P3, P4, P5, P6 and P7 output by the multi-scale feature extraction network as input to obtain intermediate results A2, A3, A4, A5, A6 and A7; wherein A2 and A7 are directly derived from P2 and P7; a3 is obtained by splicing the down-sampling result of A2 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P3 and a lower layer feature map A2 as input to generate a feature map after applying attention; a4 is obtained by splicing the down-sampling result of A3 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P4 and a lower layer feature map A3 as input to generate a feature map after attention is exerted; a5 is obtained by splicing the down-sampling result of A4 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P5 and a lower layer feature map A4 as input to generate a feature map after attention is exerted; a6 is obtained by splicing the down-sampling result of A5 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P6 and a lower layer feature map A5 as input to generate a feature map after attention is exerted; intermediate results A2, A3, A4, A5, A6 and A7 pass through a cross-layer fusion scale perception attention module, a spatial perception self-attention module and a task perception channel attention module which are connected in series to obtain final output characteristic diagrams F2, F3, F4, F5, F6 and F7;

2. The robust engineering vehicle identification method based on object detection as claimed in claim 1, wherein: in the position detection network in the step 2, input data sequentially passes through the deformable volume block, the dynamic linear modification unit, the volume block, the deformable volume block and the dynamic linear modification unit, and then output respectively passes through two paths, wherein one path is input into the volume block and the Scale layer to obtain a position prediction diagram, and the other path passes through the volume block to obtain a key point thermodynamic diagram.

3. The robust engineering vehicle identification method based on object detection as claimed in claim 1, wherein: the position correction and category prediction network in the step 2 comprises a first correction module, a second correction module and a third correction module which are sequentially connected, wherein the first correction module, the second correction module and the third correction module are respectively composed of a cutting and pooling layer, a positioning perception category prediction branch and a position correction branch;

the position correction and category prediction network inputs a rectangular prediction area output by the position prediction network and an enhanced feature map output by the feature enhancement network; when the first correction is carried out in the first correction module, cutting out sub-areas of the enhanced feature map through the rectangular prediction area, pooling the sub-areas into feature maps with uniform sizes, inputting a positioning perception type prediction branch and a position correction branch to obtain a positioning perception classification confidence coefficient and a position correction value, and applying the position correction value to the rectangular prediction area to obtain a new rectangular prediction area after the first correction; the second correction is carried out in the second correction module, the third correction is carried out in the third correction module, the same principle as the first correction is carried out in the first correction module, and the input of the first correction module is a rectangular prediction area after the previous correction and an enhanced feature map output by a feature enhancement network; and the final positioning perception classification confidence coefficient is the average value of the confidence coefficients obtained by three corrections, and the final detection result of the position correction and class prediction network is the result after the last correction.

4. The robust engineering vehicle identification method based on object detection according to any one of claims 1-3, characterized in that the detailed implementation of step 1 comprises the following sub-steps:

step 1.1: the video frame image to be identified is

Inputting a multi-scale feature extraction network, wherein the scale

So as to obtain a group of original feature maps with different scales

，ChIs the number of channels;

step 1.2: the original feature map is obtained

。

5. The robust engineering vehicle identification method based on object detection as claimed in claim 4, wherein the step 2 is implemented by the following sub-steps:

step 2.1: will enhance the feature map

Inputting the position detection network one by one to obtain a key point thermodynamic diagram

And a location prediction map

；

Step 2.2: extracting K positions with peak responses from the key point positioning thermodynamic diagram as predicted central domain key points of the engineering vehicle target

And predicting the corresponding position from the key points

And width

Height of

，

，

；

Step 2.3: cutting and pooling the enhanced feature map obtained in the step 1 according to the rectangular prediction area calculated in the step 2.2 to obtain a feature map with uniform size and containing only the features in the rectangular region of interest;

And position correction value

representing a rectangular region

Correction values of coordinates of the upper left corner and width and height;

A second correction of applying the second predicted position correction value to the rectangular prediction area after the first correction, and a third correction of applying the third predicted position correction value to the rectangular prediction area after the second correction; and then filtering redundant rectangular prediction regions by using D-IoUNMS for the result after the last correction, and shielding the rectangular prediction regions lower than the threshold value through a confidence threshold value to obtain the final detection results of all the engineering vehicles in the input image, namely the types of the engineering vehicles and the positions of the rectangular regions where the engineering vehicles are located.

6. A robust engineering vehicle identification system based on target detection is characterized by comprising the following modules:

the multi-scale feature extraction network consists of a deep convolutional neural network and a multi-scale feature fusion layer; the last 6 layers of output of the deep convolutional neural network are convolved to generate feature maps C2, C3, C4, C5, C6 and C7 with the same channel number; c7 and C6 output P7 and P6 after passing through the multi-scale feature fusion layer; the P5 is obtained by splicing and convolution-fusing the result of the up-sampling of the P6 and the output of the attention gate module, wherein the attention gate module uses C5 and P6 as input to generate a feature map after the attention is exerted; the P4 is obtained by splicing and convolution-fusing the result of the up-sampling of the P5 and the output of the attention gate module, wherein the attention gate module uses C4 and P5 as input to generate a feature map after the attention is exerted; the P3 is obtained by splicing and convolution-fusing the result of the up-sampling of the P4 and the output of the attention gate module, wherein the attention gate module uses C3 and P4 as input to generate a feature map after the attention is exerted; the P2 is obtained by splicing and convolution-fusing the result of the up-sampling of the P3 and the output of the attention gate module, wherein the attention gate module uses C2 and P3 as input to generate a feature map after the attention is exerted;

the feature enhancement network takes P2, P3, P4, P5, P6 and P7 output by the multi-scale feature extraction network as input to obtain intermediate results A2, A3, A4, A5, A6 and A7; wherein A2 and A7 are directly derived from P2 and P7; a3 is obtained by splicing the down-sampling result of A2 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P3 and a lower layer feature map A2 as input to generate a feature map after attention is exerted; a4 is obtained by splicing the down-sampling result of A3 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P4 and a lower layer feature map A3 as input to generate a feature map after applying attention; a5 is obtained by splicing the down-sampling result of A4 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P5 and a lower layer feature map A4 as input to generate a feature map after attention is exerted; a6 is obtained by splicing the down-sampling result of A5 and the output of an attention gate module and carrying out convolution fusion, wherein the attention gate module uses P6 and a lower layer feature map A5 as input to generate a feature map after attention is exerted; intermediate results A2, A3, A4, A5, A6 and A7 pass through a cross-layer fusion scale perception attention module, a spatial perception self-attention module and a task perception channel attention module which are connected in series to obtain final output feature maps F2, F3, F4, F5, F6 and F7;