CN113627269B

CN113627269B - Pest target detection method based on decoupling classification and regression feature optimal layer technology

Info

Publication number: CN113627269B
Application number: CN202110804036.5A
Authority: CN
Inventors: 宋良图; 陈天娇; 王儒敬; 谢成军; 张洁; 杜健铭; 李�瑞; 陈红波; 胡海瀛; 刘海云
Original assignee: Hefei Institutes of Physical Science of CAS
Current assignee: Hefei Institutes of Physical Science of CAS
Priority date: 2021-07-16
Filing date: 2021-07-16
Publication date: 2023-04-28
Anticipated expiration: 2041-07-16
Also published as: CN113627269A

Abstract

Compared with the prior art, the invention solves the defect of low pest identification rate caused by large size difference of the pests in the pest killing lamp. The invention comprises the following steps: acquiring a training sample set; constructing a pest target detection network; training a pest target detection network; obtaining a pest image sample to be detected; and (5) detecting and positioning pest targets. According to the invention, classification and regression tasks can be allocated to different feature layers according to the common feature layer setting to respectively obtain a final detection result, so that the difference detection is carried out on the insect body difference in the insect killing lamp environment, the insect body detection recognition rate in the insect killing lamp environment is improved, and the actual application requirement is met.

Description

Pest target detection method based on decoupling classification and regression feature optimal layer technology

Technical Field

The invention relates to the technical field of pest image detection, in particular to a pest target detection method based on decoupling classification and regression characteristic optimal layer technology.

Background

With the rapid development of technologies such as the Internet of things, cloud computing, mobile interconnection, intelligent terminals and the like, big data rapidly enter the field of vision of people. At present, large agricultural data is driving the agricultural production to be changed into precision and intelligent, and the data gradually becomes an emerging production element in modern agricultural production.

For big data of crop pests, the system has the characteristics of territory, seasonality, diversity, periodicity and the like, has wide data sources, various types and complex structures, integrates the insect condition information automatic acquisition system comprising a high-definition camera based on the remote real-time insect condition monitoring requirement of the insect condition image automatic information acquisition and prediction lamp, and can carry out light supplementing, automatic photographing, storage and remote transmission on insect bodies captured by the measurement and prediction lamp according to a set time interval, thereby acquiring high-resolution images, ensuring that the images are clearer and the background is free from sundry interference. The method mainly comprises the following steps of insect pests such as leaf rollers, armyworms, cotton bollworms, athetis lepigone, prodenia litura, black gill scara, copper green ali scara, mole cricket and flammules tenuifolia, which are small-size insect pests relative to pictures, wherein the insect pests still have large size differences, for example, the size differences of rice planthoppers and mole cricket are obvious, and after corresponding regions of interest are obtained, the target regions with the large size differences need to be searched for optimal characteristic layers for classification and regression.

The existing method uses the same layer of characteristics or the combination of different layers as the basis of classification and regression, namely the same characteristics are used for different tasks, but classification is biased to semantic information of a higher layer, and regression is biased to position information of a lower layer. However, in practical application, the uncertainty exists in the insect bodies collected in the insect killing lamp, and the large difference exists in the insect body size, so that the non-difference classification and regression technology in the prior art ensures that the insect pest identification rate in the insect killing lamp aiming at the large size difference is low, the error rate is high, and the practical application in the insect killing lamp environment is difficult to meet.

Disclosure of Invention

The invention aims to solve the defect of low pest identification rate caused by large size difference of pest bodies in a pest killing lamp in the prior art, and provides a pest target detection method based on decoupling classification and regression characteristic optimal layer technology to solve the problems.

In order to achieve the above object, the technical scheme of the present invention is as follows:

a pest target detection method based on decoupling classification and regression characteristic optimal layer technology comprises the following steps:

11 Acquisition of training sample set: acquiring pest image samples and preprocessing to form a training sample set;

12 Construction of pest target detection network: constructing a pest target detection network based on the basic feature representation network, the feature pyramid network and the target area extraction network;

13 Training of pest target detection network: training the pest target detection network based on decoupling classification and regression feature optimal layer technology by using a training sample;

14 Acquisition of pest image samples to be detected: acquiring a pest image sample to be detected, and preprocessing;

15 Detection and localization of pest targets): and inputting the preprocessed pest image sample to be detected into a trained pest target detection network, and positioning the pest position in the pest image.

The construction of the pest target detection network comprises the following steps:

21 Setting a first layer of the pest target detection network as a basic characteristic representation network, a second layer of the pest target detection network as a characteristic pyramid network and a third layer of the pest target detection network as a target area extraction network;

22 Setting a basic feature representation network as a residual network, and mining the most representative image feature representation from the depth, width and receptive field factors of the convolution block to act as a feature extractor;

23 Setting a feature pyramid network as a transversely connected hierarchical structure, and transmitting semantic information in high-level features to low-level features from top to bottom, wherein in the feature pyramid network, the feature extraction process is divided into two parts: a bottom-up process and a top-down transverse connection fusion process, wherein the bottom-up process is a process for extracting features from a backbone network;

24 A first phase network of the set target area extraction network: setting reference frames with different scales for different feature layers, for locating and returning to a preliminary possible target area, setting reference frames with 16 x 16 size for a first layer of a feature pyramid, setting reference frames with 32 x 32 size for a second layer, setting reference frames with 64 x 64 size for a third layer, setting reference frames with 128 x 128 size for a fourth layer, setting reference frames with 256 x 256 size for a fifth layer, and setting reference frames with 512 x 512 size for a sixth layer;

25 A second stage network that sets a target area extraction network: and respectively searching the optimal characteristic layers for classification and regression for the preliminary possible target areas to respectively locate and classify.

The training of the pest target detection network comprises the following steps:

31 Training basic characteristics to represent a network, inputting pest pictures I of a training sample set, wherein w is input into the basic characteristics to represent the network, and extracting characteristics through the basic characteristics to represent the network:

311 Via conv 1): the method comprises 7 x 64 convolutions with step length of 2, and adding batch normalization and nonlinear activation functions after each convolution, and the output is marked as c1;

312 Via conv2_x): firstly, carrying out maximum pooling through 3 x 3maxpooling with a step length of 2, and then, passing through 3 convolution blocks, wherein each convolution block respectively comprises 1 x 64 convolutions, 3 x 64 convolutions and 1 x 256 convolutions, the output is marked as c2, and the first convolution of the first convolution block adopts downsampling convolution with the step length of 2;

313 Via conv3_x): the method comprises 4 convolution blocks, wherein each convolution block comprises 1 x 128 convolution, 3 x 128 convolution and 1 x 512 convolution respectively, the output is marked as c3, the first convolution of the first convolution block adopts downsampled convolution with the step length of 2, and the step length of the rest convolution blocks is 1;

314 Via conv4_x): each convolution block comprises 1 x 256 convolutions, 3 x 256 convolutions and 1 x 1024 convolutions, the output is marked as c4, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1;

315 Via conv5_x): each convolution block comprises 1 x 512 convolutions, 3 x 512 convolutions and 1 x 2048 convolutions, the output is marked as c5, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1;

32 Training a feature pyramid network:

inputting feature graphs c1, c2, c3, c4 and c5 with different channels into a feature pyramid network respectively, carrying out channel normalization through convolution of 1 x 256 to obtain M1, M2, M3, M4 and M5 respectively, carrying out up-sampling and M4 addition on the feature graphs M5 to obtain a feature graph M4, carrying out up-sampling and M3 addition on the feature graph M4 to obtain a feature graph M3, carrying out up-sampling and M2 addition on the feature graph M3 to obtain a feature graph M2, and carrying out up-sampling and M1 addition on the feature graph M2 to obtain a feature graph M1; the expression is as follows:

M4＝m4+upsampling(M5),

M3＝m3+upsampling(M4),

M2＝m2+upsampling(M3),

M1＝m1+upsampling(M2)；

in order to eliminate the aliasing effect of the upsampling, P1, P2, P3, P4 and P5 are obtained by convolving M1, M2, M3, M4 and M5 with 3*3 of the same channel, and at the same time P5 is obtained by downsampling to obtain a feature map P6, which has the following expression:

where w=2xw1, wi= 2*w (i+1), i e {1,2,3,4,5};

33 First stage network of training target area extraction network:

for different feature layers P1, P2, P3, P4, P5 and P6, the area sizes of the reference frames are respectively set to 16×16, 32×32, 64×64, 128×128, 256×256, 512×512, and each area corresponds to 3 length-width contrasts, namely 1:2, 1:1 and 2:1, so that 18 candidate frames are used for extracting the first stage network in the target area;

for the P1 layer, w1 x h1 x 3 reference frames are taken as a total, and if a certain reference frame and a certain real frame have the highest IOU or the IOU of any real frame is greater than 0.7, positive samples are set; if the IOU of a certain reference frame and any real frame is less than 0.3, setting to be a negative sample; a first stage network extracted through positive and negative sample learning target areas;

the first stage network input of the target area extraction network is a characteristic diagram of different characteristic layers, and consists of 3 x 256 convolutions, parallel 1*1 x (3 x 4) regression branches and 1*1 x (3*2) convolution classification branches, and network parameters are learned through the loss back propagation errors of network values and true values; finally, extracting a preliminary target area through a first-stage network of a target area extraction network;

34 A second stage network that trains the target area extraction network:

setting a preliminary target area and a IoU of a real frame to be more than 0.5, wherein the frame is a positive sample; when IoU of the preliminary target area and the real frame is smaller than 0.3, the frame is a negative sample;

decoupling classification and regression tasks are carried out on the preliminary target region, and an optimal classification feature layer for classification and an optimal positioning feature layer for positioning are respectively searched for and classified and positioned;

the characteristic layer is selected by

Wherein k0 is set to be 4, and then the k layer characteristics are used as an optimal characteristic layer according to the width w and the height h of the preliminary target area and the common setting k0 to classify and regress the preliminary target area in the second stage;

the k-1 layer is used for positioning and the k+1 layer is used for classifying the preliminary target area, after the optimal characteristic layer is found, the preliminary target area is classified corresponding to the optimal classified characteristic layer, and then the optimal positioning characteristic layer is used for positioning, network parameters are learned through the loss back propagation error of the network value and the true value, the smoothL1 loss is used for positioning loss, and the softmax cross entropy loss is used for classifying loss.

The detection of the pest object comprises the following steps:

41 Inputting a pest image sample I to be detected, w is input into a basic characteristic representation network:

411 7 x 64 convolutions with step size of 2, each convolution is added with batch normalization and nonlinear activation function, and the output is marked as c1;

412 First through maximum pooling with 3 x 3max pooling with step size of 2, then through 3 convolution blocks, each convolution block contains 1 x 64 convolution, 3 x 64 convolution, 1 x 256 convolution, and output is marked as c2;

413 4 convolutions of 1 x 128 convolutions, 3 x 128 convolutions, 1 x 512 convolutions, and the output is denoted as c3;

414 23 convolutions, each convolutions comprising 1 x 256 convolutions, 3 x 256 convolutions, 1 x 1024 convolutions, the output being denoted c4;

415 3 convolutions of 1 x 512 convolutions, 3 x 512 convolutions, 1 x 2048 convolutions, and the output is denoted as c5;

42 Inputting feature graphs c1, c2, c3, c4 and c5 with different channels into a feature pyramid network respectively, carrying out channel normalization by convolution of 1 x 256 to obtain M1, M2, M3, M4 and M5 respectively, and carrying out up-sampling and M4 addition on the feature graph M5 to obtain a feature graph M4 and simultaneously obtaining M1, M2 and M3 respectively; in order to eliminate the aliasing effect of the up-sampling, the M1, M2, M3, M4 and M5 are respectively convolved by adopting 3*3 of the same channel to obtain P1, P2, P3, P4 and P5, and meanwhile, the P5 obtains a characteristic diagram P6 through down-sampling;

43 Input target area extraction network first stage network:

setting reference frames with different scales aiming at different feature layers P1, P2, P3, P4, P5 and P6, inputting a trained target region extraction network into a first stage network, firstly carrying out 3 x 256 convolution, and then extracting to a preliminary target region through a parallel 1*1 x (3 x 4) regression branch and a 1*1 x (3*2) convolution classification branch;

44 Input target area extraction network second stage network):

the decoupling classification and regression tasks are carried out aiming at the preliminary target area, the optimal classification characteristic layer for classification and the optimal positioning characteristic layer for positioning are respectively searched for classification and positioning,

according to the characteristic layer selection mode

Wherein k0 is commonly set to 4, the k-1 layer is selected for positioning, the k+1 layer is selected for classification, and then the preliminary target area is corresponding to the optimal target areaClassifying the classification feature layers, and positioning the detection target according to the optimal positioning feature layer to obtain the final detection target.

Advantageous effects

Compared with the prior art, the pest target detection method based on the decoupling classification and regression feature optimal layer technology can be used for assigning classification and regression tasks to different feature layers according to the common feature layer setting to obtain final detection results, so that the pest body difference in the pest killing lamp environment is detected in a differentiated mode, the pest detection recognition rate in the pest killing lamp environment is improved, and the requirements of practical application are met.

Drawings

FIG. 1 is a process sequence diagram of the present invention;

FIG. 2 is a diagram showing a pest image detection low-level feature;

fig. 3 is a diagram showing the characteristics of the high-level layer detected by the pest image.

Detailed Description

For a further understanding and appreciation of the structural features and advantages achieved by the present invention, the following description is provided in connection with the accompanying drawings, which are presently preferred embodiments and are incorporated in the accompanying drawings, in which:

as shown in fig. 1, the pest target detection method based on the decoupling classification and regression characteristic optimal layer technology of the invention comprises the following steps:

first, acquiring a training sample set: and acquiring pest image samples and preprocessing to form a training sample set.

Secondly, constructing a pest target detection network: and constructing a pest target detection network based on the basic feature representation network, the feature pyramid network and the target area extraction network.

(1) Setting a first layer of a pest target detection network as a basic characteristic representation network, a second layer of the pest target detection network as a characteristic pyramid network and a third layer of the pest target detection network as a target area extraction network;

(2) Setting a basic feature representation network as a residual network, and mining the most representative image feature representation from the depth, width and receptive field factors of the convolution block to act as a feature extractor;

(3) Setting a feature pyramid network as a transversely connected hierarchical structure, transmitting semantic information in high-level features to low-level features from top to bottom, wherein in the feature pyramid network, a feature extraction process is divided into two parts: a bottom-up process and a top-down transverse connection fusion process, wherein the bottom-up process is a process for extracting features from a backbone network;

(4) First stage network of setting target area extraction network: setting reference frames with different scales for different feature layers, for locating and returning to a preliminary possible target area, setting reference frames with 16 x 16 size for a first layer of a feature pyramid, setting reference frames with 32 x 32 size for a second layer, setting reference frames with 64 x 64 size for a third layer, setting reference frames with 128 x 128 size for a fourth layer, setting reference frames with 256 x 256 size for a fifth layer, and setting reference frames with 512 x 512 size for a sixth layer;

(5) Setting a second stage network of the target area extraction network: and respectively searching the optimal characteristic layers for classification and regression for the preliminary possible target areas to respectively locate and classify.

Thirdly, training a pest target detection network: training the pest target detection network based on decoupling classification and regression feature optimal layer technology by using a training sample.

Because the object detection task includes classifying and locating the 2 tasks, the features required by the 2 tasks are different, the classification requires high-level semantic information, and the locating requires low-level location information, as shown in fig. 2 and 3, the form in which the high-level semantic information and the low-level location information are transferred is also different. The most favorable feature layers for 2 tasks are not the same, so that the most favorable features are required to be found in the pyramid feature layers separately for different tasks, and finally summarized.

(1) Training underlying features represent the network:

inputting the pest pictures I of the training sample set, wherein w is input into a basic feature representation network, and extracting features through the basic feature representation network:

a1 Via conv 1): the method comprises 7 x 64 convolutions with step length of 2, and adding batch normalization and nonlinear activation functions after each convolution, and the output is marked as c1;

a2 Via conv2_x): firstly, carrying out maximum pooling through 3 x 3maxpooling with a step length of 2, and then, passing through 3 convolution blocks, wherein each convolution block respectively comprises 1 x 64 convolutions, 3 x 64 convolutions and 1 x 256 convolutions, the output is marked as c2, and the first convolution of the first convolution block adopts downsampling convolution with the step length of 2;

a3 Via conv3_x): the method comprises 4 convolution blocks, wherein each convolution block comprises 1 x 128 convolution, 3 x 128 convolution and 1 x 512 convolution respectively, the output is marked as c3, the first convolution of the first convolution block adopts downsampled convolution with the step length of 2, and the step length of the rest convolution blocks is 1;

a4 Via conv4_x): each convolution block comprises 1 x 256 convolutions, 3 x 256 convolutions and 1 x 1024 convolutions, the output is marked as c4, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1;

a5 Via conv5_x): each convolution block comprises 1 x 512 convolutions, 3 x 512 convolutions and 1 x 2048 convolutions, the output is marked as c5, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1.

(2) Training a feature pyramid network: because c1, c2, c3, c4, c5 are downscaled step by pooling downsampling and convolutional downsampling, semantic information is more abundant and spatial location information is reduced at the same time, so features are mixed through the feature pyramid network.

M4＝m4+upsampling(M5),

M3＝m3+upsampling(M4),

M2＝m2+upsampling(M3),

M1＝m1+upsampling(M2)；

where w=2xw1, wi= 2*w (i+1), i e {1,2,3,4,5}.

(3) First stage network of training target area extraction network:

the first stage network input of the target area extraction network is a characteristic diagram of different characteristic layers, and consists of 3 x 256 convolutions, parallel 1*1 x (3 x 4) regression branches and 1*1 x (3*2) convolution classification branches, and network parameters are learned through the loss back propagation errors of network values and true values; finally, the preliminary target area is extracted through the first stage network of the target area extraction network.

(4) Training the second stage network of the target area extraction network: because the classification and positioning tasks are different, the required features most beneficial to the tasks are also different, so that the features possibly scattered on different layers are more beneficial to different tasks, the target detection comprises classifying and positioning 2 tasks, and the current target detection task still uses the same layer or the same features to do different tasks although a plurality of methods are used for searching the optimal layer or the combination of different layers.

the characteristic layer is selected by

Because more semantic information features are needed for classification and more spatial position information features are needed for positioning, aiming at a preliminary possible target area, a k-1 layer is selected to be used for positioning, a k+1 layer is used for classification or an optimal layer of different tasks is found in other modes, such as an optimal layer is found through a minimum loss layer of different tasks, after the optimal feature layer is found, the preliminary possible target area is classified corresponding to the optimal classification feature layer and then is positioned corresponding to the optimal positioning feature layer, and network parameters are learned through loss back propagation errors of network values and true values.

Fourth, obtaining a pest image sample to be detected: and acquiring a pest image sample to be detected, and preprocessing.

Fifthly, detecting and positioning pest targets: and inputting the preprocessed pest image sample to be detected into a trained pest target detection network, and positioning the pest position in the pest image.

The detection of the pest object comprises the following steps:

(1) Inputting a pest image sample I to be detected, w is input into a basic characteristic representation network:

b1 7 x 64 convolutions with step size of 2, each convolution is added with batch normalization and nonlinear activation function, and the output is marked as c1;

b2 First through maximum pooling with 3 x 3max pooling with step size of 2, then through 3 convolution blocks, each convolution block contains 1 x 64 convolution, 3 x 64 convolution, 1 x 256 convolution, and output is marked as c2;

b3 4 convolutions of 1 x 128 convolutions, 3 x 128 convolutions, 1 x 512 convolutions, and the output is denoted as c3;

b4 23 convolutions, each convolutions comprising 1 x 256 convolutions, 3 x 256 convolutions, 1 x 1024 convolutions, the output being denoted c4;

b5 Through conv5_x, each convolution block contains 1 x 512 convolutions, 3 x 512 convolutions, and 1 x 2048 convolutions, and the output is denoted as c5.

(2) Respectively inputting feature graphs c1, c2, c3, c4 and c5 with different channels into a feature pyramid network, carrying out channel normalization by convolution of 1 x 256 to respectively obtain M1, M2, M3, M4 and M5, and carrying out up-sampling and M4 addition on the feature graph M5 to obtain a feature graph M4, and simultaneously respectively obtaining M1, M2 and M3; in order to eliminate the aliasing effect of the upsampling, P1, P2, P3, P4, P5 are obtained by convolving M1, M2, M3, M4, M5 with 3*3 of the same channel, respectively, while P5 is downsampled to obtain the feature map P6.

(3) Input target area extraction network first stage network:

for different feature layers P1, P2, P3, P4, P5 and P6, reference frames with different scales are set, a first stage network of a trained target region extraction network is input, and the first stage network is firstly subjected to 3 x 256 convolution, and then is extracted to a preliminary target region through parallel 1*1 x (3 x 4) regression branches and 1*1 x (3*2) convolution classification branches.

(4) Input target area extraction network second stage network:

according to the characteristic layer selection mode

Wherein k0 is commonly set to 4, the k-1 layer is used for positioning, the k+1 layer is used for classifying, and then the preliminary target area is obtainedThe domain is classified corresponding to the optimal classification characteristic layer, and then is positioned corresponding to the optimal positioning characteristic layer, so that a final detection target is obtained.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The pest target detection method based on the decoupling classification and regression characteristic optimal layer technology is characterized by comprising the following steps of:

131 Training basic characteristics to represent a network, inputting pest pictures I of a training sample set, wherein w is input into the basic characteristics to represent the network, and extracting characteristics through the basic characteristics to represent the network:

1311 Via conv 1): the method comprises 7 x 64 convolutions with step length of 2, and adding batch normalization and nonlinear activation functions after each convolution, and the output is marked as c1;

1312 Via conv2_x): firstly, carrying out maximum pooling through 3 x 3maxpooling with a step length of 2, and then, passing through 3 convolution blocks, wherein each convolution block respectively comprises 1 x 64 convolutions, 3 x 64 convolutions and 1 x 256 convolutions, the output is marked as c2, and the first convolution of the first convolution block adopts downsampling convolution with the step length of 2;

1313 Via conv3_x): the method comprises 4 convolution blocks, wherein each convolution block comprises 1 x 128 convolution, 3 x 128 convolution and 1 x 512 convolution respectively, the output is marked as c3, the first convolution of the first convolution block adopts downsampled convolution with the step length of 2, and the step length of the rest convolution blocks is 1;

1314 Via conv4_x): each convolution block comprises 1 x 256 convolutions, 3 x 256 convolutions and 1 x 1024 convolutions, the output is marked as c4, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1;

1315 Via conv5_x): each convolution block comprises 1 x 512 convolutions, 3 x 512 convolutions and 1 x 2048 convolutions, the output is marked as c5, the first convolution of the first convolution block adopts downsampled convolutions with the step length of 2, and the step length of the rest convolution blocks is 1;

132 Training a feature pyramid network:

M4＝m4+upsampling(M5),

M3＝m3+upsampling(M4),

M2＝m2+upsampling(M3),

M1＝m1+upsampling(M2)；

/>

where w=2xw1, wi= 2*w (i+1), i e {1,2,3,4,5};

133 First stage network of training target area extraction network:

134 A second stage network that trains the target area extraction network:

the characteristic layer is selected by

selecting a k-1 layer for positioning and a k+1 layer for classifying a preliminary target area, classifying the preliminary target area corresponding to an optimal classifying feature layer after finding an optimal feature layer, positioning the preliminary target area corresponding to the optimal positioning feature layer, learning network parameters through the loss back propagation error of a network value and a true value, wherein the positioning loss uses smoothL1 loss, and the classifying loss uses softmax cross entropy loss;

2. The pest target detection method based on the decoupling classification and regression feature optimization layer technique according to claim 1, wherein the construction of the pest target detection network includes the steps of:

3. The vermin target detection method based on the decoupling classification and regression feature optimization layer technique according to claim 1, wherein the detection of vermin object includes the steps of:

31 Inputting a pest image sample I to be detected, w is input into a basic characteristic representation network:

311 7 x 64 convolutions with step size of 2, each convolution is added with batch normalization and nonlinear activation function, and the output is marked as c1;

312 First through maximum pooling with 3 x 3max pooling with step size of 2, then through 3 convolution blocks, each convolution block contains 1 x 64 convolution, 3 x 64 convolution, 1 x 256 convolution, and output is marked as c2;

313 4 convolutions of 1 x 128 convolutions, 3 x 128 convolutions, 1 x 512 convolutions, and the output is denoted as c3;

314 23 convolutions, each convolutions comprising 1 x 256 convolutions, 3 x 256 convolutions, 1 x 1024 convolutions, the output being denoted c4;

315 3 convolutions of 1 x 512 convolutions, 3 x 512 convolutions, 1 x 2048 convolutions, and the output is denoted as c5;

32 Inputting feature graphs c1, c2, c3, c4 and c5 with different channels into a feature pyramid network respectively, carrying out channel normalization by convolution of 1 x 256 to obtain M1, M2, M3, M4 and M5 respectively, and carrying out up-sampling and M4 addition on the feature graph M5 to obtain a feature graph M4 and simultaneously obtaining M1, M2 and M3 respectively; in order to eliminate the aliasing effect of the up-sampling, the M1, M2, M3, M4 and M5 are respectively convolved by adopting 3*3 of the same channel to obtain P1, P2, P3, P4 and P5, and meanwhile, the P5 obtains a characteristic diagram P6 through down-sampling;

33 Input target area extraction network first stage network:

34 Input target area extraction network second stage network):

according to the characteristic layer selection mode

Wherein k0 is commonly set as 4, the k-1 layer is used for positioning, the k+1 layer is used for classifying, then the preliminary target area is classified corresponding to the optimal classifying feature layer, and then the final detection target is obtained after positioning corresponding to the optimal positioning feature layer. />