CN112270366B

CN112270366B - Micro target detection method based on self-adaptive multi-feature fusion

Info

Publication number: CN112270366B
Application number: CN202011204130.9A
Authority: CN
Inventors: 朱智勤; 张源川; 李嫄源; 冒睿睿; 李鹏华
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-08-26
Anticipated expiration: 2040-11-02
Also published as: CN112270366A

Abstract

The invention relates to a micro target detection method based on self-adaptive multi-feature fusion, and belongs to the field of target detection. The extracted features are subjected to feature fusion through the traditional feature pyramid result, on the basis, a path is additionally designed for further feature fusion, and then multi-scale fusion is performed through a self-adaptive multi-feature fusion algorithm, so that the semantic information of the micro target is transmitted in a multi-scale feature layer, and the semantic information and texture information of the micro target are enriched. Meanwhile, a more reasonable prior frame parameter is obtained by using a k-means algorithm, network convergence is accelerated, and model precision is improved. And finally, carrying out non-maximum value inhibition on the detection result, and screening out overlapped object frames. The whole network continuously updates the network weight in an end-to-end mode until convergence. Finally, the self-adaptive multi-feature fusion micro-target detection algorithm can effectively complete the detection of the micro-target.

Description

Micro target detection method based on self-adaptive multi-feature fusion

Technical Field

The invention belongs to the field of target detection, and relates to a micro target detection method based on self-adaptive multi-feature fusion.

Background

Although many deep learning algorithms based on object detection have appeared in recent years and have brought strong improvements to object detection, there are many places to be improved for detecting tiny objects (less than 15 × 15 pixels) in images. Before the deep learning method is popular, for targets with different scales, image pyramids with different resolutions are constructed from an original image, and then a detector with a fixed input resolution is used for detecting the target for each layer of pyramids, so that small targets are detected at the bottom of the pyramids. However, for some complex background and large-resolution images, the image resolution is large and the target is small, so that the use of the image pyramid causes problems of excessive calculation amount and memory consumption. In recent years, a lot of results have been obtained in the detection of targets by deep learning methods, and for the field of small target detection, there are mainly several types of methods: characteristic pyramid, super-resolution, GAN network enhancement and the like. The super-resolution and GAN network enhancement methods greatly increase the calculation amount and memory consumption when the input image is large, and the main defect of the feature pyramid is the inconsistency among features of different scales. In order to solve the above problems, the invention provides a detection method specially for a tiny target, which adds a new path on the basis of a traditional Feature pyramid to enhance semantic information of a small target, and fuses Multi-scale features of the new path, so as to be called an adaptive Multi-Feature Fusion Network (adaptive Multi-Feature Fusion Network). Meanwhile, the invention also designs a Lightweight Multi-level Feature Extraction Network (light Multi-level Feature Extraction Network), which can simply and effectively extract Multi-scale features. The whole network model consists of a multi-level feature extraction network, a multi-feature fusion network and a detection network, wherein the multi-level feature extraction network is adopted to extract features of an input image to obtain high-level semantic and low-level semantic features; semantic information transmission from top to bottom and from bottom to top is carried out on the obtained multi-scale features, and a self-adaptive multi-feature fusion method is used for carrying out self-adaptive fusion on the features with different scales, so that the semantic information of the tiny target is enriched; and performing multi-scale prediction through a detection network, and generating an objective proposal box for classification and regression tasks by using a k-means method. The invention discloses a novel feature fusion method which can be directly applied to a detector using a feature pyramid structure and has better effect and robustness on detecting tiny targets in an image.

Disclosure of Invention

In view of this, the present invention provides a method for detecting a small target based on adaptive multi-feature fusion.

In order to achieve the purpose, the invention provides the following technical scheme:

the method for detecting the tiny target based on the self-adaptive multi-feature fusion comprises the following steps:

1) extracting high-level semantic and low-level semantic information of a tiny target by using the proposed lightweight multilevel feature extraction network, wherein the whole feature extraction network consists of five feature extraction modules, each feature extraction module consists of a [3 x 3,2] convolution network and three convolution blocks, and the depth and the feature extraction capability of the network are improved by using a residual error connection mode;

2) the characteristic layers with the down sampling rates of 8, 16 and 32 pass through a characteristic pyramid structure, the dimensionality is processed by using a convolution network of [1 multiplied by 1,1], the scale problem is processed by using a bilinear interpolation algorithm, and the characteristic dimensionality is improved by adopting connection on a channel in a fusion mode;

3) additionally adding a path on the basis of the characteristic pyramid structure to enrich semantic information and texture information of a tiny target, further extracting characteristics and adjusting dimensionality by using a [3 multiplied by 3,2] convolution network, wherein the fusion mode is still channel addition;

4) then passing the features after twice fusion through a self-adaptive multi-feature fusion network, wherein the up-sampling uses a bilinear interpolation algorithm, and the down-sampling is completed by adopting a [3 x 3,2] convolution network and maximum pooling; meanwhile, dimension matching is carried out by using a convolution network of 1 multiplied by 1, a required weight parameter is generated by using the convolution network of 1 multiplied by 1,1 with an output channel of 3, and finally the weight parameter is multiplied to a corresponding characteristic layer for fusion;

5) the network obtains a prior frame by adopting a k-means algorithm, clustering is carried out according to the target frame scale of the object in the data set to obtain k scales of the prior frame, and convergence of the model is accelerated;

6) finally, the fused features are respectively passed through a [3 x 3,1] convolution network to reach the output requirement of detection, and a non-maximum suppression algorithm is used for result screening; the whole network is trained in an end-to-end mode until the model converges.

In the step 1), a lightweight multilevel feature extraction network is used for extracting high-level semantic and low-level semantic information of an input image, the network is composed of a plurality of feature extraction modules, and the specific structure is as follows:

a) each feature extraction module consists of a [3 x 3,2] convolution network and three convolution blocks, wherein 3 x 3 is the convolution kernel size, 2 is the step length and is used for completing the downsampling process, and the downsampling rate is 2;

b) each convolution block in the characteristic extraction module consists of a [1 × 1,1] convolution network and a [3 × 3,1] convolution network, and a residual error connection mode of adding corresponding elements is used to improve the nonlinear capacity and depth of the model;

c) the feature extraction network has five feature extraction modules in total, the down-sampling rate is 32(2^5), and in the feature layer with the down-sampling rate of 8, 16 and 32, namely, the feature extraction network corresponds to the third, fourth and five feature extraction modules, and the output feature graph is used for self-adaptive multi-feature fusion.

Optionally, in step 2), the feature layers with the down-sampling rates of 8, 16, and 32 are respectively denoted as p3, p4, and p5, and then the feature layers are subjected to a feature pyramid structure to obtain a multi-scale feature, which includes the following specific steps:

a) the p5 layers are processed through a [1 x 1,1] convolution network, mainly used for dimensionality reduction, the output dimensionality is adjusted to be that of a p4 layer, and the output characteristic layer of the layer is recorded as c 5;

b) in addition, an upsampling layer is adopted, a bilinear interpolation algorithm is used (in the invention, the upsampling default is the bilinear interpolation algorithm, unless specially stated, the upsampling rate is increased by 2 times (namely the downsampling rate of p5 after upsampling is 16); after 1 × 1 convolution and up-sampling layer, the output dimensionality and down-sampling rate are matched with the p4 layer, so that the feature maps of the p5 layer and the p4 layer can be subjected to channel addition to obtain a fused feature map, and the feature map is subjected to 1 × 1 convolution by a feature extraction module to obtain a feature layer c 4;

c) similarly, c4 passes through an up-sampling layer, and is subjected to channel addition with the feature map of the p3 layer to obtain a fused feature map, and then the fused feature map passes through a feature extraction module to obtain a feature layer c 3.

Optionally, in step 3), on the basis of the traditional feature pyramid, a path from bottom to top is added to enrich semantic information of the micro target, and the specific steps are as follows:

a) passing the c3 layer through a [3 x 3,2] convolution network, further extracting features and adjusting output dimensions to match the c4 layer, performing channel addition on the feature map of the c4 layer to obtain a fused feature map, and passing through a feature extraction module to obtain a feature layer c 4';

b) similarly, c4 'is processed through a [3 × 3,2] convolution network and feature fusion with the c5 layer, and then processed through a feature extraction module to obtain a feature layer c 5'.

Optionally, in the step 4), the feature layers c3, c4 'and c 5' are obtained for subsequent detection, and the specific steps are as follows:

a) the c5 'layer is used as a fusion layer, so the c 4' layer needs to be subjected to 2 times of downsampling, namely a convolution network of [3 x 3,2] is used for realizing, the c3 layer needs to be subjected to 4 times of downsampling, namely the 2 times of downsampling is firstly carried out by using maximum pooling, and then the convolution network of [3 x 3,2] is used; then the layer c5 ' and the processed layers c4 ' and c3 pass through a self-adaptive fusion network to obtain a fusion result F5 of the layer c5 ';

b) the layer c4 'is used as a fusion layer, so the layer c 5' needs to be up-sampled by 2 times, and the layer c3 needs to be down-sampled by 2 times, namely, the convolution network of [3 x 3,2] is used for realizing the fusion; similarly, the layer c4 ' and the processed layers c5 ' and c3 are subjected to a self-adaptive fusion network to obtain a fusion result F4 of the layer c4 ';

c) the layer c3 is used as a fusion layer, so the layer c5 'needs to be up-sampled by 4 times, and the layer c 4' needs to be up-sampled by 2 times; similarly, a fusion result F3 of the c3 layer is obtained after the self-adaptive fusion network.

Optionally, in the step 5), the adaptive fusion network is formed by using a plurality of [1 × 1,1] convolution networks, in the layer c5 ', the layer c5 ', the layer c4 ' and the layer c3 which are processed are respectively subjected to dimensionality reduction processing by one [1 × 1,1] convolution network, the three feature maps after convolution are added on a channel, then the feature maps after convolution pass through the [1 × 1,1] convolution network with an output channel of 3, and finally the layer c5 ', the layer c4 ' and the layer c3 which are processed are respectively multiplied by weight parameters obtained by the adaptive fusion network, and then the weight parameters are added to obtain a fusion result F5; the same applies to the case of using the c 4' or c3 layer as the fusion layer, and this is expressed by equation (1):

F ^level ＝α ^level ·x ^3→level +β ^level ·x ^4→level +γ ^level ·x ^5→level (1)

in the formula, level represents the current fusion layer, x ^n→level The method is characterized in that the feature layers with different down-sampling rates are adjusted to the feature layer with the resolution of the fusion layer, the fusion layer corresponding to the level does not need to be adjusted, and alpha is ^level 、β ^level And gamma ^level Represents a weight parameter, wherein ^level Is represented by formula (2):

in the formula (I), the compound is shown in the specification,

and

is [ 1X 1,1] with an output channel of 3]Weight, beta, corresponding to each channel after convolution network ^level And gamma ^level The same applies to the definition of (1).

Optionally, in the step 6), after the adaptive multi-feature fusion network is performed, three fused feature layers F5, F4, and F3 are obtained for a subsequent detection network, and before that, prior frame parameters required by the detection network need to be calculated according to a data set; the prior frame parameters obtained by calculation through the k-means algorithm can be more reasonable than those set by an empirical method, so that the convergence of the network is accelerated, and the model has better performance, wherein the k-means calculation formula is as follows:

in the formula, x ⁽ⁱ⁾ Is the size of the target box in the dataset, i ═ 1,2, 3., m; j is the prior box to get k scales, with default k being 9, j being 1,2, 3. Mu.s _j Represents the center after clustering, defined as follows:

by repeating the calculations of equation (3) and equation (4) until the algorithm converges.

Optionally, after the step 6), the method further comprises the step 7): after a priori frame is obtained, the feature layers F5, F4 and F3 are input into a detection network for detection, the detection network is composed of three [3 x 3,1] convolution networks and aims to perform dimension matching and dimension reduction processing to meet the output requirement of detection, and finally the recognition result of the detection network is subjected to non-maximum suppression to obtain the final detection result.

The invention has the beneficial effects that:

the invention relates to a micro-target detection method based on self-adaptive multi-feature fusion, the traditional micro-target detection method is generally based on an image pyramid, and along with the development of deep learning, methods such as super-resolution and GAN network enhancement gradually make better progress in the field of micro-targets, but when an input image is too large and has a complex background or the number of objects to be detected is too large, the methods can cause the problems of increased calculated amount, overflow of memory and the like. The invention discloses a self-adaptive multi-feature fusion method, which can improve the recognition result of a tiny target, hardly needs to increase redundant memory consumption and time consumption, and simultaneously designs a lightweight multi-stage feature extraction network which can efficiently extract image features while reducing the parameter quantity and the calculation quantity of a model.

The method uses a lightweight multilevel feature extraction network to extract features of an input image, so that the obtained feature map contains feature information of high-level semantics and low-level semantics; secondly, path enhancement is carried out by using a traditional characteristic pyramid structure, and then a path is added to the traditional path for enhancement, so that the target characteristic information is richer; then, a self-adaptive multi-feature fusion method is used for feature fusion, so that semantic information of the tiny target is richer, and the recall rate and the accuracy rate of the network model are greatly improved; then, calculating to obtain prior frame parameters required by the detection network according to the target size of the data set by using a k-means algorithm, so that the convergence speed of the network model is accelerated, and the generalization performance of the model is improved; and finally, identifying the result through a detection network, and continuously updating the network weight by adopting an end-to-end mode until convergence.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For a better understanding of the objects, aspects and advantages of the present invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a diagram of a lightweight multi-level feature extraction network architecture;

FIG. 2 is a diagram of an adaptive multi-feature fusion network architecture;

fig. 3 is an overall structure diagram of a network model.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Referring to fig. 1 to 3, a method for detecting a small target based on adaptive multi-feature fusion includes the following steps:

1) extracting high-level semantic and low-level semantic information of an input image by using a lightweight multilevel feature extraction network, wherein the network consists of a plurality of feature extraction modules and has the following specific structure:

c) the feature extraction network has five feature extraction modules in total, the down-sampling rate is 32(2^5), and the feature graph is output at the feature layer (corresponding to the third, fourth and five feature extraction modules) with the down-sampling rate of 8, 16 and 32 for self-adaptive multi-feature fusion.

2) Respectively recording the characteristic layers with the down sampling rates of 8, 16 and 32 as p3, p4 and p5, and further obtaining the multi-scale characteristics by passing the characteristic layers through a characteristic pyramid structure, wherein the specific steps are as follows:

b) in addition, an upsampling layer is adopted, a bilinear interpolation algorithm is used (in the invention, the upsampling default is the bilinear interpolation algorithm, unless specially stated, the upsampling rate is increased by 2 times (namely the downsampling rate of p5 after upsampling is 16); after 1 × 1 convolution and upsampling, the output dimensionality and downsampling rate are matched with the p4 layer, so that feature maps of the p5 layer and the p4 layer can be subjected to channel addition to obtain a fused feature map, and the fused feature map is subjected to convolution with 1 × 1 through a feature extraction module to obtain a feature layer c 4;

3) On the basis of the traditional characteristic pyramid, a path from bottom to top is added for enriching the semantic information of the tiny target, and the method specifically comprises the following steps:

a) the c3 layer is further extracted with characteristics and output dimension is adjusted through a [3 x 3,2] convolution network to be matched with the c4 layer, and then channel addition is carried out on the c4 layer characteristic diagram to obtain a fused characteristic diagram, and a characteristic layer c 4' is obtained through a characteristic extraction module;

4) After the above operations, the feature layers c3, c4 'and c 5' are obtained for subsequent detection, and at this time, although the semantic information of the feature layers c3 and c4 'is subjected to path enhancement twice, the semantic information is still not as rich as that of the feature layer c 5'. The feature texture information provides accurate position information for a target, the strength of the semantic information can help to judge whether the object is a foreground or a background or what kind of object, a feature layer with a low down-sampling rate has high texture information but insufficient semantic information, and a feature layer with a high down-sampling rate has rich semantic information but insufficient texture information. Therefore, the invention discloses a method for self-adaptive multi-feature fusion, which can effectively enrich insufficient information in feature layers with different down-sampling rates, and comprises the following specific steps:

a) the layer c5 'is used as a fusion layer, so the layer c 4' needs to be subjected to 2 times of down sampling, namely realized by a convolution network of [3 x 3,2], and the layer c3 needs to be subjected to 4 times of down sampling, namely, the maximum pooling is firstly used for 2 times of down sampling, and then the convolution network of [3 x 3,2] is used; then the layer c5 ' and the processed layers c4 ' and c3 are subjected to a self-adaptive fusion network to obtain a fusion result F5 of the layer c5 ';

5) The adaptive fusion network is composed of a plurality of convolution networks of 1 × 1, taking a layer c5 ' as a fusion layer as an example, respectively performing dimensionality reduction on the layer c5 ', the processed layers c4 ' and c3 through a convolution network of [1 × 1,1], adding the three feature maps after convolution on channels, then passing through a convolution network of [1 × 1,1] with an output channel of 3, and finally respectively multiplying the layer c5 ', the processed layers c4 ' and c3 by weight parameters obtained by the adaptive fusion network, and then adding to obtain a fusion result F5; the same applies when the layer c 4' or c3 is used as the fusion layer. This process can be expressed by equation (1):

in the formula, level represents the current fusion layer, x ^n→level Indicates that the feature layers with different down-sampling rates are adjusted to the feature layer with the fusion layer resolution (note that the fusion layer corresponding to level does not need to be adjusted), and alpha ^level 、β ^level And gamma ^level Represents a weight parameter, wherein ^level Is represented by formula (2):

in the formula (I), the compound is shown in the specification,

and

6) After the self-adaptive multi-feature fusion network is adopted, three fused feature layers F5, F4 and F3 are obtained and are used for a subsequent detection network, and before the three fused feature layers are used for the subsequent detection network, the prior frame parameters required by the detection network are obtained through calculation according to a data set. The prior frame parameters obtained by calculation through the k-means algorithm can be more reasonable than those set by an empirical method, so that the convergence of the network is accelerated, and the model has better performance, wherein the k-means calculation formula is as follows:

in the formula, x ⁽ⁱ⁾ Is the scale of the target box in the data set1,2,3, ·, m; j is the prior box to get k scales (default k 9), j 1,2, 3. Mu.s _j Represents the center after clustering, defined as follows:

7) After a priori frame is obtained, the feature layers F5, F4 and F3 can be input into a detection network for detection, the detection network is composed of three [3 x 3,1] convolution networks and aims to perform dimension matching and dimension reduction processing to meet the output requirement of detection, and finally the recognition result of the detection network is subjected to non-maximum suppression to obtain the final detection result.

1. Extracting high-level semantic and low-level semantic features of the input image by using a lightweight multistage feature extraction network, and storing intermediate results of the last three feature layers for subsequent feature fusion;

2. firstly, a traditional characteristic pyramid path is enhanced once, a path from bottom to top is added on the basis of the traditional characteristic pyramid for enriching the characteristic information of the tiny target, and finally, a self-adaptive multi-characteristic fusion method is used for carrying out multi-layer characteristic fusion, so that the semantic information of the tiny target is further improved;

3. the prior frame parameters are obtained by using a k-means algorithm, the recognition result of the image is obtained by a detection network and a non-maximum suppression method, and the weight parameters are continuously updated by adopting an end-to-end training mode in the whole network until the network is converged.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that various changes and modifications may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The method for detecting the tiny target based on the self-adaptive multi-feature fusion is characterized by comprising the following steps: the method comprises the following steps:

2) processing the feature layers with the down sampling rates of 8, 16 and 32 by a feature pyramid structure, processing dimensionality by using a [1 x 1,1] convolution network, processing a scale problem by using a bilinear interpolation algorithm, and improving the feature dimensionality by adopting connection on a channel in a fusion mode;

3) additionally adding a path on the basis of the characteristic pyramid structure to enrich semantic information and texture information of a tiny target, further extracting characteristics and adjusting dimensionality by using a [3 x 3,2] convolution network, wherein the fusion mode is still channel addition;

6) finally, the fused features are respectively passed through a [3 x 3,1] convolution network to reach the output requirement of detection, and a non-maximum suppression algorithm is used for result screening; training the whole network in an end-to-end mode until the model converges;

c) the feature extraction network has five feature extraction modules in total, and outputs a feature map for self-adaptive multi-feature fusion in a feature layer with a lower sampling rate of 8, 16 and 32, namely corresponding to the third, fourth and five feature extraction modules;

in the step 2), the feature layers with the down-sampling rates of 8, 16 and 32 are respectively marked as p3, p4 and p5, and then the feature layers are subjected to a feature pyramid structure to obtain a multi-scale feature, which specifically comprises the following steps:

a) the p5 layers are subjected to a [1 x 1,1] convolution network, mainly used for dimension reduction processing, the output dimension is adjusted to the dimension of the p4 layer, and the output characteristic layer of the layer is recorded as c 5;

b) the sampling rate can be increased by 2 times after the upsampling by using a bilinear interpolation algorithm in the upsampling layer, namely the downsampling rate of the upsampled p5 is 16; after 1 × 1 convolution and upsampling, the output dimensionality and downsampling rate are matched with the p4 layer, so that feature maps of the p5 layer and the p4 layer can be subjected to channel addition to obtain a fused feature map, and the fused feature map is subjected to convolution with 1 × 1 through a feature extraction module to obtain a feature layer c 4;

c) similarly, c4 passes through an upper sampling layer and is subjected to channel addition with the characteristic diagram of the p3 layer to obtain a fused characteristic diagram, and then the fused characteristic diagram passes through a characteristic extraction module to obtain a characteristic layer c 3;

in 3), on the basis of the traditional characteristic pyramid, a path from bottom to top is added to enrich semantic information of the tiny target, and the specific steps are as follows:

b) similarly, c4 'is subjected to feature fusion with the c5 layer through a [3 × 3,2] convolution network, and then is subjected to a feature extraction module to obtain a feature layer c 5';

in the step 4), the feature layers c3, c4 'and c 5' are obtained for subsequent detection, and the specific steps are as follows:

a) the layer c5 'is used as a fusion layer, so the layer c 4' needs to be subjected to 2 times of down sampling, namely realized by a convolution network of [3 x 3,2], and the layer c3 needs to be subjected to 4 times of down sampling, namely, the maximum pooling is firstly used for 2 times of down sampling, and then the convolution network of [3 x 3,2] is used; then the layer c5 ' and the processed layers c4 ' and c3 pass through a self-adaptive fusion network to obtain a fusion result F5 of the layer c5 ';

c) the layer c3 is used as a fusion layer, so the layer c5 'needs to be up-sampled by 4 times, and the layer c 4' needs to be up-sampled by 2 times; similarly, a fusion result F3 of the c3 layer is obtained after the self-adaptive fusion network;

in the step 5), the adaptive fusion network is composed of a plurality of convolution networks of [1 × 1,1], in the layer c5 ', the layer c5 ', the processed layers c4 ' and c3 are respectively subjected to dimensionality reduction processing through the convolution network of [1 × 1,1], the three feature maps after convolution are added on a channel, then the feature maps after convolution pass through the convolution network of [1 × 1,1] with an output channel of 3, and finally the layer c5 ', the processed layers c4 ' and c3 are respectively multiplied by weight parameters obtained by the adaptive fusion network and then added to obtain a fusion result F5; the same applies to the case of using the c 4' or c3 layer as the fusion layer, and this is expressed by equation (1):

in the formula, level represents the current fusion layer, x ^n→level Indicates the feature layer after adjusting the feature layers of different down-sampling rates to the resolution of the fusion layer, alpha ^level 、β ^level And gamma ^level Represents a weight parameter, wherein ^level Is represented by formula (2):

in the formula (I), the compound is shown in the specification,

and

2. The method for detecting the tiny target based on the adaptive multi-feature fusion as claimed in claim 1, wherein: in the step 6), after the adaptive multi-feature fusion network is performed, three fused feature layers F5, F4 and F3 are obtained and used for a subsequent detection network, and before that, prior frame parameters required by the detection network need to be calculated according to a data set; the prior frame parameters obtained by calculation through the k-means algorithm can be more reasonable than those set by an empirical method, so that the convergence of the network is accelerated, and the model has better performance, wherein the k-means calculation formula is as follows:

in the formula, x ⁽ⁱ⁾ Being object boxes in the data setDimension, i ═ 1,2,3,. ·, m; j is a priori box to get k dimensions, default k 9, j 1,2, 3. Mu.s _j Represents the center after clustering, defined as follows:

3. The method for detecting the tiny target based on the adaptive multi-feature fusion as claimed in claim 2, wherein: after said 6), further comprising 7): after a priori frame is obtained, the feature layers F5, F4 and F3 are input into a detection network for detection, the detection network is composed of three [3 x 3,1] convolution networks and aims to perform dimension matching and dimension reduction processing to meet the output requirement of detection, and finally the recognition result of the detection network is subjected to non-maximum suppression to obtain the final detection result.