CN115471721A

CN115471721A - Image target detection method, system, electronic device and storage medium

Info

Publication number: CN115471721A
Application number: CN202211087762.0A
Authority: CN
Inventors: 王嘉荣; 李岩山; 张坤华
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2022-12-13

Abstract

The invention discloses an image target detection method, a system, an electronic device and a storage medium, wherein the method comprises the following steps: carrying out feature extraction on an input image to obtain three feature maps; performing feature fusion on all feature maps to obtain a fusion image; predicting the fused image to obtain a predicted image; post-processing the predicted image to obtain a final detection result of the input image; the method comprises the following steps of performing feature extraction on an input image to obtain three feature maps: and performing five times of downsampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes, and reserving the last three feature maps which are respectively a first feature map, a second feature map and a third feature map. According to the invention, in the training process of the network, the large-size characteristic diagram retains abundant position information, and the small-size characteristic diagram retains condensed semantic information, so that the characteristic fusion is more effective, and the detection precision of the network is higher.

Description

Image target detection method, system, electronic device and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image target detection method, an image target detection system, an electronic device, and a storage medium.

Background

Image detection is an important branch of image processing, especially, target detection of aerial images of unmanned aerial vehicles serves as a key ring of unmanned aerial vehicle technology, and detection performance directly influences application of the unmanned aerial vehicles in military and civil fields. The existing high-precision detection algorithm has the defects of more parameters, larger weight and difficulty in being deployed in small equipment.

Disclosure of Invention

The invention mainly aims to provide an image target detection method, an image target detection system, an electronic device and a storage medium, and aims to solve the technical problems that detection algorithms in the prior art are large in parameter quantity and weight and are difficult to deploy in small equipment.

In order to achieve the above object, a first aspect of the present invention provides an image target detection method, including: carrying out feature extraction on an input image to obtain three feature maps; performing feature fusion on all feature maps to obtain a fusion image; predicting the fused image to obtain a predicted image; post-processing the predicted image to obtain a final detection result of the input image; the method comprises the following steps of performing feature extraction on an input image to obtain three feature maps: and performing five times of downsampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes, and reserving the last three feature maps which are respectively a first feature map, a second feature map and a third feature map.

Further, performing feature fusion on all feature maps to obtain a fused image includes: adjusting dimensionality of the third feature map to obtain a first image, performing transposition convolution on the first image, splicing the first image with the second feature map, performing a first residual error structure, adjusting dimensionality to obtain a second image, performing transposition convolution on the second image, splicing the second image with the first feature map, and performing a second residual error structure to obtain a first fusion image; the first fused image is downsampled, is spliced with a second image, and passes through a third residual error structure to obtain a second fused image; and downsampling the second fused image, splicing the second fused image with the first image, and obtaining a third fused image through a fourth residual error structure.

Further, predicting the fused image to obtain a predicted image includes: performing convolution and matrix rearrangement operations on the first fusion image, the second fusion image and the third fusion image in the three prediction heads respectively, and adjusting the characteristic dimensionality of each fusion image to a uniform numerical value to obtain three prediction images; post-processing the predicted image to obtain a final detection result of the input image comprises: setting a confidence threshold, and removing prior frames with confidence degrees smaller than the threshold in the three predicted images; setting IoU threshold values by using NMS algorithm, and respectively comparing the intersection ratio of the prior frame and the real frame in the three predicted images; screening out a prior frame with the highest numerical value from the prior frames with the intersection ratio higher than the IoU threshold; and summarizing all prior frames screened out by the 3 prediction heads to obtain a final detection result of the input image.

Further, the first residual structure, the second residual structure, the third residual structure and the fourth residual structure all use the same residual structure, and the residual structure includes a first channel structure and a second channel structure: the first channel structure is a Conv structure of 1*1; the second channel structure includes: a Ghost CBS structure, a Ghost neck structure, a Concat structure and a 1 × 1cbs structure; the structure of GhostNeck includes: a Ghost CBS structure, a SENET structure, a Concat structure and a Ghost CBS structure, wherein the input of the Ghost CBS structure is also one input of the Concat structure; the SENET structure comprises: a Flatten structure, a 1 × 1conv structure, a Multiply structure, an input of the Flatten structure, also an input of the Multiply structure; the Ghost CBS structure comprises: 1 × 1cbs structure, 5 × 5dwcbs structure, concat structure, the output of 1 × 1cbs structure being one input of Concat structure; the Ghost CBS structures of the first channel structure and the second channel structure have the same input, and the output of the first channel structure is the input of the Concat structure of the second channel structure; the image processing method comprises the steps of obtaining a 1 × 1Conv structure, splicing images, reducing dimensionality of the images, and restoring dimensionality of the images, wherein the 1 × 1Conv structure is used for conducting 1*1 convolution on the images, the Concat structure is used for splicing the images, the Ghost CBS structure is used for reducing dimensionality of the images, and the 1 × 1CBS structure is used for restoring dimensionality of the images.

Further, in the generation process of the first fusion image, a network architecture of YOLO-M is used, and the CBS structure is used for adjusting the dimensionality; or in the generation process of the first fusion image, a YOLO-L network architecture is used, and the Ghost CBS structure is used for adjusting the dimensionality; or, in the generation process of the first fused image, by using a network architecture of YOLO-S, the operation of the transpose convolution is replaced by an upsampling operation of nearest neighbor interpolation.

Further, the method of processing an image using a residual structure includes: performing 1*1 convolution on an input image in a first channel, and reducing the dimensionality of the image to be half of the original dimensionality to obtain a first image; and reducing dimensionality and dimensionality weighting of the input image in a second channel, splicing the input image with the first image, fusing the characteristics, and obtaining a second image with unchanged width, height and dimensionality compared with the input image.

Further, reducing the dimension and weighting the dimension of the input image in the second channel comprises: carrying out dimensionality reduction processing on an input image to obtain a low-dimensional image; dimension weighting is performed on the low-dimensional image.

Further, performing dimensionality reduction processing on the input image includes: and reducing the dimension of the input image to be half of the input dimension to obtain a dimension-reduced image.

Further, dimension weighting the low-dimensional image includes: flattening the low-dimensional image; convolving the flattened low-dimensional image with 1*1 to reduce dimensionality; activating a low-dimensional image by using a SiLU function; recovering the low dimensional image using a convolution of 1*1; using Sigmoid activation to obtain a dimension weight; and multiplying the input image by the dimension weight to obtain a dimension weighted output image.

The second aspect of the present invention provides a system for detecting an image target, including: the characteristic extraction module is used for extracting the characteristics of the input image to obtain three characteristic graphs; the characteristic fusion module is used for carrying out characteristic fusion on all the characteristic graphs to obtain a fusion image; the prediction module is used for predicting the fused image to obtain a predicted image; the post-processing module is used for performing post-processing on the predicted image to obtain a final detection result of the input image; the feature extraction module includes: the down-sampling unit is used for carrying out five times of down-sampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes; and the feature retaining unit is used for retaining the last three feature maps, wherein the last three feature maps are respectively a first feature map, a second feature map and a third feature map.

A third aspect of the present invention provides an electronic apparatus comprising: the image object detection method comprises a memory and a processor, wherein the memory is stored with a computer program capable of running on the processor, and the processor realizes any one of the image object detection methods when executing the computer program.

A fourth aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the image object detection method of any one of claims 1 to 9.

The invention provides an image target detection method, a system, an electronic device and a storage medium, which have the advantages that: in the training process of the network, the large-size characteristic diagram can keep rich position information, the small-size characteristic diagram can keep condensed semantic information, the characteristic fusion is more effective, and the detection precision of the network is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of an image target detection method according to an embodiment of the present invention;

FIG. 2 is a diagram of the architecture of a YOLO-M target detection network of the image target detection method according to the embodiment of the present invention;

fig. 3 is a structural diagram of CSPGhostNeck in the image target detection method according to the embodiment of the present invention;

FIG. 4 is a structural diagram of GhostNeck and BottleNeck in the image target detection method according to the embodiment of the invention;

FIG. 5 is a block diagram of CSPNet and CommoneNet in an image object detection method according to an embodiment of the present invention;

FIG. 6 is a diagram of the architecture of a YOLO-L target detection network of the image target detection method according to the embodiment of the present invention;

FIG. 7 is a block diagram of a YOLO-S target detection network of the image target detection method according to the embodiment of the present invention;

FIG. 8 is a partial image of the data set VisDrone2021-DET for the image target detection method in accordance with an embodiment of the present invention;

fig. 9 is a partial image of a data set CARPK of an image target detection method according to an embodiment of the invention;

fig. 10 is a comparative scatter diagram based on a data set CARPK of the image target detection method according to the embodiment of the present invention;

FIG. 11 is a comparative scatter plot based on data sets VisDrone2021-DET of the image target detection method according to the embodiment of the present invention;

FIG. 12 is a graph of the result of visualization detection of YOLO-L, YOLO-M, YOLO-S based on the data set CARPK according to the image target detection method of the embodiment of the present invention;

FIG. 13 is a block diagram of an image target detection system in accordance with an embodiment of the present invention;

FIG. 14 is a block diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order to make the objects, features and advantages of the present invention more obvious and understandable, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In recent years, with the rapid development of the unmanned aerial vehicle technology, the collection of aerial images is more convenient, and the processing and analysis of the aerial images of the unmanned aerial vehicle have become the current research focus. The unmanned aerial vehicle aerial image is highly unstructured, and the coverage area is big, is difficult to separate the foreground and the background, and unmanned aerial vehicle has requirements such as miniaturization, low energy consumption to the load moreover to make the unmanned aerial vehicle aerial image target detection based on degree of depth study be full of the challenge, it has important meaning to study the target detection algorithm of the unmanned aerial vehicle aerial image. However, the unmanned aerial vehicle aerial image target detection network with high detection precision often has the characteristics of large parameter quantity, heavy weight and the like, and causes certain obstacles to the application of the target detection network based on deep learning on the unmanned aerial vehicle.

Researchers at home and abroad are developing a series of lightweight deep networks, including MobileNet, shuffleNet, ghostNet, YOLOv4-Tiny, etc. The YOLOv5n network is a light-weight network for target detection, and test results on a natural scene image data set show that the method greatly reduces the parameters of the network under the condition of ensuring certain detection accuracy.

Because the aerial image of the unmanned aerial vehicle has the characteristics of the aerial image of the unmanned aerial vehicle, the invention combines the characteristics of the aerial image of the unmanned aerial vehicle under the frame of the YoLO, and provides three lightweight target detection networks of the aerial image of the unmanned aerial vehicle, namely YOLO-L, YOLO-M and the YoLO-S, in order to reduce the parameter quantity and the weight of the target detection network of the aerial image of the unmanned aerial vehicle and ensure that the target detection network can be deployed on equipment with limited storage resources and computing resources, the invention points are summarized as follows:

(1) A lightweight CSPGhostNeck residual error structure is designed. By utilizing the CSPGhostNeck residual error structure to fuse sparse features of aerial images, the parameters and the weight of a detection network can be effectively reduced.

(2) Combining a CSPGhostNeck residual error structure and a transposed convolution, three lightweight networks YOLO-L, YOLO-M and YOLO-S with different parameters are respectively provided. The transposition convolution is introduced into the up-sampling, so that the network learns the up-sampling rule, the feature information of the low-layer feature diagram is better reserved, the target position information in the aerial image of the unmanned aerial vehicle is transmitted, and the detection precision is effectively improved. The transposition convolution is introduced into YOLO-L, and the detection precision is highest; the YOLO-S uses a CSPGhostNeck residual error structure, and the network lightweight degree is highest; and the YOLO-M simultaneously uses a CSPGhostNeck residual error structure and introduces transposition convolution, so that the detection precision is improved while the parameter number and the weight of the network are reduced.

Referring to fig. 1, an image target detection method includes:

s101, performing feature extraction on an input image to obtain three feature maps;

s102, performing feature fusion on all feature maps to obtain a fusion image;

s103, predicting the fused image to obtain a predicted image;

and S104, post-processing the predicted image to obtain a final detection result of the input image.

In the embodiment, a lightweight CSPGhostNeck residual error structure is designed, YOLOv5n is used as a reference network, and the transposed convolution is introduced, so that a lightweight target detection network YOLO-M suitable for aerial images of unmanned aerial vehicles is obtained.

The YOLO-M target detection network mainly comprises several stages of feature extraction, feature fusion, prediction and post-processing, and the overall network architecture is shown in FIG. 2. In the figure, CBS is an abbreviation of convolution-batch normalization-sulu activation function, and indicates that convolution, batch normalization, and sulu activation operations are performed. Setting the convolution parameters of the CBS can change the dimensionality of the input image or down sample the input image. The SPPF is a spatial pyramid pooling structure and consists of a CBS structure and Maxpool maximum pooling operation, and the input and output shapes of the SPPF are kept consistent. Head denotes a predictive Head, CG denotes a CSPGhostNeck residual structure, and CB denotes a CSPBottleNeck residual structure.

In step S101, feature extraction is performed on an input image, and obtaining three feature maps includes: and performing five times of downsampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes, and reserving the last three feature maps which are respectively a first feature map, a second feature map and a third feature map.

In the feature extraction stage, aiming at the characteristic that the aerial image is low in resolution, a CSPBotttleKey residual error structure, a CBS structure and an SPPF structure are used for effectively extracting features. In order to detect targets with different scales, the feature extraction network performs 5 times of downsampling on the input image to obtain feature maps with different sizes, and the last three feature maps are reserved. The down-sampling here is achieved by a CBS structure, setting the convolution kernel size to 3*3, step size to 2, and boundary padding to 1. After 5 times of downsampling, the width and height of the minimum feature map is changed to 1/32 of the width and height of the input image. Since down-sampling reduces the size of the image, losing valid features, the features are supplemented by increasing the dimensions of the feature map. The feature map gradually increases in dimension as the down-sampling proceeds. The CBS structure for down-sampling adopts a SiLU activation function, and the expression of the function is shown as formula (1):

SiLU(x)＝x*σ(x) (1)

and a residual structure is connected after the down-sampling operation, so that the network can be deepened, and the gradient disappearance can be avoided. The residual structure may have different manifestations. And the YOLO-M uses a CSPBottleReck residual error structure to effectively extract deep features. The minimum feature map is fed into the SPPF structure, and the structure comprises 3 times of maximum pooling operations of 5*5, so that the receptive field can be increased, and the network can sense the image more comprehensively. And taking the result of the three times of down-sampling and SPPF as the input of the feature fusion stage.

In one embodiment, in step S102, performing feature fusion on all feature maps to obtain a fused image includes: adjusting dimensionality of the third feature map to obtain a first image, splicing the first image with the second feature map after transposing and convolving the first image, obtaining a second image after the first image passes through a first residual error structure and the dimensionality is adjusted, transposing and convolving the second image, splicing the second image with the first feature map, and obtaining a first fusion image through a second residual error structure; the first fused image is downsampled, is spliced with a second image, and passes through a third residual error structure to obtain a second fused image; and downsampling the second fused image, splicing the second fused image with the first image, and obtaining a third fused image through a fourth residual error structure.

In one embodiment, a method of processing an image using a residual structure includes: performing 1*1 convolution on an input image in a first channel, and reducing the dimensionality of the image to half of the original dimensionality to obtain a first image; and reducing dimensionality and dimensionality weighting of the input image in a second channel, splicing the input image with the first image, and fusing features to obtain a second image which is not changed in width and height and dimensionality compared with the input image.

In one embodiment, reducing the dimensions, weighting the dimensions, of the input image in the second channel comprises: carrying out dimensionality reduction processing on an input image to obtain a low-dimensional image; dimension weighting is performed on the low-dimensional image.

In one embodiment, the dimensionality reduction processing of the input image includes: and reducing the dimensionality of the input image to half of the input dimensionality to obtain a dimension-reduced image.

In one embodiment, dimensionally weighting the low-dimensional image comprises: flattening the low-dimensional image; convolving the flattened low-dimensional image with 1*1 to reduce dimensionality; activating a low-dimensional image by using a SiLU function; recovering the low dimensional image using convolution with 1*1; using Sigmoid activation to obtain a dimension weight; and multiplying the input image by the dimension weight to obtain a dimension weighted output image.

In the feature fusion stage, a PANet network is used for fusing extracted features, and the composition units comprise a CBS structure, a CSPGhostNeck residual error structure, a transposition convolution and splicing operation. Wherein, the function of a part of the CBS structure is to adjust the dimensionality, and the function of the other part of the CBS structure is to down-sample. The PANet network contains two channels, a bottom-up channel and a top-down channel. The bottom-up channel transfers low-level information to high-level, and the top-down channel transfers high-level information to low-level. Since the sizes of the feature maps of different layers are different, the feature maps need to be up-sampled or down-sampled before feature transfer. In YOLO-M, a transposed convolution is introduced, and the upsampling operation is completed by setting the parameters of the transposed convolution. Because the transposed convolution has learnable parameters, the detection network can learn effective up-sampling rules in the training process, reduce information loss and better complete feature fusion, thereby improving the detection precision. The down-sampling operation of YOLO-M is completed by CBS structure.

In the embodiment, in the generation process of the first fusion image, a network architecture of YOLO-M is used, and the CBS structure is used for adjusting the dimensions; in other embodiments, a YoLO-L network architecture can be used in the generation process of the first fusion image, and the adjustment dimensions all use a Ghost CBS structure; in other embodiments, the network architecture of YOLO-S may also be used in the generation of the first fused image, and the operation of transposing convolution is replaced with an upsampling operation of nearest neighbor interpolation.

After the characteristic diagrams of the upper layer and the lower layer are spliced, the characteristic diagrams are sent into a residual error structure to be fused, and the aliasing effect is reduced. In order to reduce the parameter quantity and weight of the network, a lightweight CSPGhostNeck residual error structure is designed. The YOLO-M uses the characteristic of the CSPGhostNeck residual error structure after fusion splicing, and the magnitude of the network is effectively reduced.

In one embodiment, the step S103 of predicting the fused image to obtain a predicted image includes: and performing convolution and matrix rearrangement operations on the first fusion image, the second fusion image and the third fusion image in the three prediction heads respectively, and adjusting the characteristic dimensionality of each fusion image to a uniform numerical value to obtain three prediction images.

In the prediction stage, aiming at the characteristic that the target scale span of the aerial image of the unmanned aerial vehicle is large, 3 prediction heads are used for predicting targets with different scales. The prediction head comprises operations of convolution and matrix rearrangement, and can adjust the dimension of the feature map to a uniform numerical value. And (3) if the category number of the data set is C and the number of the prior frames is n, the calculation process of the adjusted dimension C is shown as a formula (2).

C＝n×(c+4+1) (2)

In the formula, 4 represents the offset value of 4 coordinates, and 1 represents whether the target is included in the prior frame. Is the number of classes representing the probability that the prior box belongs to each class.

In one embodiment, the step S104 of performing post-processing on the predicted image to obtain a final detection result of the input image includes: setting a confidence threshold, and removing prior frames with confidence degrees smaller than the threshold in the three predicted images; setting IoU threshold values by using NMS algorithm, and respectively comparing the intersection ratio of the prior frame and the real frame in the three predicted images; screening out a prior frame with the highest numerical value from the prior frames with the intersection ratio higher than the IoU threshold; and summarizing all prior frames screened out by the 3 prediction heads to obtain a final detection result of the input image.

And in the post-processing stage, the 3 prediction heads are post-processed, and effective prior frames are screened out. Setting a confidence threshold, and removing a prior frame with the confidence smaller than the threshold; according to the NMS algorithm, a IoU threshold is set, and the intersection ratio of the prior frame and the real frame is compared. For a number of prior boxes with a cross-over ratio above the IoU threshold, the highest value one is retained. And summarizing the post-processing output of the 3 measuring heads to obtain a final detection result.

In this embodiment, in order to reduce the magnitude of the network and reduce the parameters and weights of the network, a lightweight residual structure, called CSPGhostNeck, is designed in this embodiment.

The structure of CSPGhostNeck is shown in FIG. 3. In the figure, CBS represents convolution-bulk normalization-sulu activation operation, DWCBS represents deep convolution-bulk normalization-sulu activation operation, conv represents convolution operation, flatten represents flattening operation, concat represents splicing operation, add represents addition operation, and Multiply represents multiplication operation.

Fig. 3 (a) is the overall structure of CSPGhostNeck, and the constituent units include a GhostNeck structure, a Ghost CBS structure, a Conv operation, and a Concat operation. CSPGhostNeck comprises two channels, one channel performs 1*1 convolution, and the dimensionality is reduced to half of the original dimensionality; the other channel is firstly sent into a GhostCBS structure to reduce dimensionality, and then is sent into N GhostNeck structures; and splicing the results of the two channels, and sending the results into the 1*1CBS structure to restore dimensionality so as to obtain an output image. After the input image passes through CSPGhostNeck, the width, the height and the dimensionality of the image are not changed, and the input image has the functions of fusing and splicing the characteristics and reducing the aliasing effect. By using the CSPGhostNeck residual error structure, the parameter quantity and the weight of the detection network can be effectively reduced. The calculation process of CSPGhostNeck is shown in formula (3). Where x is the input image and y is the output image.

y＝CBS _1*1 (Concat(Conv _1*1 (x),GhostNeck ^N (GhostCBs(x)))) (3)

FIG. 3 (b) is the GhostNeck structure, which is an important component of CSPGhostNeck. And the GhostNeck sends the input image into a Ghost CBS structure of 1*1 to reduce dimensionality, then sends the input image into a SEnet structure to carry out dimensionality weighting, and finally, the input image is spliced with the input image and sent into the Ghost CBS structure to restore dimensionality, so that an output image is obtained. After the input image passes through the GhostNeck structure, the width and the height of the image are not changed, and the dimension of the image is not changed. The calculation process of GhostNeck is shown in formula (4).

y＝GhostcBS(Concat(x,SENet(GhostCBS(x)))) (4)

FIG. 3 (c) is the SENET structure, which is an important component of GhostNeck. SENET is an attention mechanism that flattens the input image, then feeds into a convolution of 1*1 to reduce dimensionality, and is activated by the SilU function; then, the data are sent to 1*1 for convolution recovery dimensionality, and are activated through a Sigmoid function to obtain dimensionality weight; and finally, multiplying the input image by the dimension weight to obtain a dimension-weighted output image. After the SENET structure, the width, the height and the dimension of the input image and the output image are kept consistent. The SEnet calculation process is shown in formula (5).

y＝Multiply(x,Conv _1*1,Sigmoid (Conv _1*2,SiLU (Flatten(x)))) (5)

FIG. 3 (d) is the GhostCBS structure, which is involved in the construction of CSPGhostNeck and GhostNeck. First, the Ghost CBS feeds the input image into the CBS structure of 1*1, changing the dimension to half the output dimension. Then, the two channels are divided, one channel does not carry out any processing, and the other channel is sent into the DWCBS structure of 5*5 for deep convolution. And splicing the results of the two channels to obtain an output image. After the input image passes through the Ghost CBS structure, the width and the height of the input image and the output image are kept consistent, and the dimension of the output image is set according to the requirement. The calculation process of Ghost CBS is shown in formula (6).

y＝Concat(CBS _1*1 (x),DWCBS _5*5 (CBS _1*1 (x))) (6)

In this example, CSPGhostNeck can be viewed as a combination of GhostNeck and CSPNet: ghostNeck is the internal component of CSPGhostNeck, while CSPNet is the external component of CSPGhostNeck. From the function, the BottleNeck can realize the function of GhostNeck, and Commonet has the same effect with CSPNet; their distinctions are in structure, number of parameters and weights.

The parameter is stored in the storage device in the form of bytes and is visualized as a weight file, so that the parameter and the weight are in a positive correlation relationship, and the weight is reduced along with the reduction of the parameter. In the following, comparing the structure and the quantity of parameters of GhostNeck with BottleNeck, CSPNet and CommoneNet respectively, and proving that GhostNeck and CSPNet can effectively reduce the quantity and the weight of parameters of a network from a theoretical level; CSPGhostNeck is a combination of both, and thus is a lightweight residual structure.

1) Comparison of GhostNeck with BottleNeck

The GhostNeck and the BottleNeck both belong to residual error structures, so that the network can be deepened, and the gradient disappearance can be avoided. The structures of GhostNeck and BottleNeck are shown in FIG. 4.

First, the structures of both are compared. Ghestneck reduces dimensionality by Ghost CBS, while BottleNeck reduces dimensionality by 1*1's CBS; ghostNeck uses the attention mechanism, bottleNeck does not; ghostNeck stitches the input image with the intermediate result, then feeds into the Ghost CBS to restore dimensions, while BottleNeck restores dimensions through the CBS of 3*3, and finally adds the input image with the intermediate result. No change in the shape of the image occurs whether GhostNeck or BottleNeck is entered.

Then, the parameter values of the two are compared. The concatenation and addition operations do not produce parameters, and the main source of the network parameters is the convolution operation. Let the dimension of the input be C _in With output dimension of C _out And the convolution kernel size is n × n, and if there is no offset, the parameter quantity of a CBS is:

Params＝C _in *n*n*C _out (7)

a DWCBS has a parameter of

Params＝C _in *n*n (8)

For SENEt, its convolution kernel is fixed in size to 1*1, with the output dimension equal to the input dimension. The parameter number of SEnet is

Params＝2*C _in *C _med (9)

In the formula, C _med Is a middle dimension which is equal to the input dimension C _in Multiplied by the compression ratio, i.e.: c _med ＝C _in *ratio。

For both GhostNeck and BottleNeck, the input and output dimensions are C. The dimension is reduced by half by the CBS structure of 1*1; the dimension is unchanged through the DWCBS structure of 5*5; with the CBS structure of 3*3, the dimension becomes twice as large. In the SEnet structure, the compression ratio is set to 0.5, i.e. the dimension of the intermediate feature map is equal to half the dimension of the input image. Then the parameters of GhostNeck are

The parameters of BottleNeck are

The ratio of the parameters of GhostNeck to BottleNeck is

In an actual network, C takes on the value of 64,128,256, then R takes on the value of 0.26,0.28,0.31, R takes on the minimum value of 0.26, and R takes on the maximum value of 0.31. The GhostNeck is used for replacing the BottleNeck, the parameter quantity is changed to be 0.26 at least and is changed to be 0.31 at most, and therefore the GhostNeck can greatly reduce the parameter quantity, further reduce the weight and enable the network to be light.

2) Comparison of CSPNet with CommoneNet

CSPNet is a dual-channel structure, the main channel can be sent to any network unit, and the auxiliary channel can be adjusted according to the requirement; common net is a single channel structure that feeds the input image directly into the network element. CSPNet is a design concept and can have different implementation forms. The CSPNet realized in this section comprises a Ghost CBS structure, a CBS structure, conv operation, concat operation and a main channel unit. Taking the CBS main channel unit fed to 3*3 as an example, compare the difference between CSPNet and CommoneNet. The structures of CSPNet and CommoneNet are shown in FIG. 5.

First, the structures of both are compared. For CSPNet, it contains two channels, the main channel sends into Ghost CBS structure before sending into 3*3 CBS structure, reduces the dimension by half; the secondary channel is convolved with 1*1. And splicing the results of the two channels, and then sending the results into the 1*1CBS structure to restore dimensionality so as to obtain an output image. For CommonNet, it is a CBS structure that feeds the input image directly into 3*3.

Then, the parameter values of the two are compared. The input and output dimensions are all C, the parameters of CSPNet are

The quantity of common MonNet is

P _CN ＝C*3*3*C＝9C ² (14)

The ratio of the parameters of CSPNet and common is

And (3) performing convolution on the visible light image, wherein the dimension C is more than or equal to 3, and the ratio R of the parameters of CSPNet and CommonNet is less than or equal to 0.68. Compared with common Net, CSPNet can effectively reduce the quantity and weight of parameters, and is an effective means for reducing the network level.

CSPGhostNeck combines the advantages of GhostNeck and CSPNet, and has more obvious effect of reducing the number and the weight of the parameters, so that the lightweight degree of the network is higher.

In the feature fusion stage, in order to fuse features with a higher-level feature map, an up-sampling operation needs to be performed on the lower-level feature map. And the YOLO-M introduces transposed convolution to complete the up-sampling operation, so that the lower-layer feature diagram retains more complete information and the fusion with the higher-layer feature diagram is more effective.

The transposed convolution, also known as deconvolution, is the inverse of convolution. The convolution input and output are in a many-to-one relationship, and the transposition convolution input and output are in a one-to-many relationship.

In general, the input image is subjected to a transposition convolution and the output image has a size of

O＝(I-1)*s+k-p*2+op (16)

Where O represents the output size, I represents the input size, s represents the step size, k represents the size of the convolution kernel, p is the input pad, and op is the output pad.

The parameters of the transposed convolution are set, so that the up-sampling operation can be completed. In the lightweight YOLO network, double upsampling is required for the low-level feature map, so the size of the convolution kernel of the transposed convolution is set to 3*3, the step size is 2, the input padding is 1, and the output padding is 1. As can be seen from equation (16), the size of the output image is twice the size of the input image. And finally, carrying out batch normalization and SiLU function activation on the amplified images to obtain the characteristics of normalized distribution.

The traditional method adopts nearest neighbor interpolation to perform upsampling, takes the pixel value of an adjacent point closest to a sampling point as the pixel value of the sampling point, has simple calculation and high speed, but the image is easy to have mosaic and zigzag. The nearest neighbor interpolation is an up-sampling rule set manually, has fixed calculation parameters and is rigid; the transposition convolution enables the detection network to learn the up-sampling rule, and the calculation parameters are not fixed and flexible. Compared with nearest neighbor interpolation, the transposed convolution enables the up-sampling rule to better meet the requirement of target detection, and therefore detection precision is improved.

In one embodiment, in the generation process of the first fused image, a network architecture of YOLO-L is used, and Ghost CBS structures are used for adjusting dimensions.

In the embodiment, the image target detection method uses a YOLO-L network design, and the structure of the YOLO-L is shown in fig. 6, and similarly includes four stages of feature extraction, feature fusion, prediction, and post-processing.

First, in the feature extraction stage, YOLO-L extracts features using CBS structure, CSPBottleNeck residual structure, and SPPF structure. The CBS has the function of down-sampling and extracting features of different layers; the CSPBottleneck has the functions of deepening the network, preventing the gradient from disappearing and enhancing the feature extraction capability of the network; SPPF has the functions of increasing receptive field and aggregating high-level semantic features. And 5 feature maps with different sizes are generated through multiple times of downsampling. The minimum feature map is sent to the SPPF structure to obtain more condensed semantic information. And keeping the result of the last three times of downsampling and SPPF as the input of the feature fusion network.

Then, in a feature fusion stage, selecting PANET as a feature fusion network, wherein the composition units comprise a Ghost CBS structure, a CSPBottleneck residual error structure, a transposition convolution, maxpool maximum value operation and splicing operation. In the bottom-up path of the PANet, the dimension of the small-size feature map is reduced through a Ghost CBS structure, and then transposition convolution is carried out to complete effective up-sampling. And splicing with an upper layer feature diagram, and then sending a CSPBottleneck residual error structure to realize the interaction of splicing features. In a top-down path of the PANET, a large-size feature map completes downsampling by utilizing Maxpool maximum value pooling operation, is spliced with a lower-layer feature map, and finally is sent into a CSPBottleneck residual error structure for fusion splicing. The feature fusion network PANet outputs three fused feature graphs with different sizes as input of a prediction stage.

And finally, in the prediction and post-processing stage, performing prediction and post-processing on the feature maps of the three sizes, summarizing the output of the post-processing, and obtaining a detection result.

The YOLO-L uses the CSPBottleNeck residual structure to more efficiently fuse features. Compared with CSPGhostNeck, CSPBottleneck has more parameters, larger weight, more complex calculation and better characteristic processing effect. In the feature fusion stage, the YOLO-L introduces the transposition convolution to complete the up-sampling operation, so that the information of the small-size feature map is retained to the maximum extent, and the performance of the feature fusion network is enhanced. Although the detection accuracy of the network is improved by the CSPBotttleNeck and the transposed convolution, the magnitude of the network is increased. In order to counteract the influence of CSPBottleneck and transposition convolution, in the feature fusion stage, a Ghost CBS structure is used by YOLO-L to replace a CBS structure to complete the dimension adjustment of a low-layer feature map, so that the parameter quantity and the weight are reduced; and the downsampling function is realized by replacing a CBS structure with Maxpool maximum pooling operation, and no additional parameter and weight are generated. Therefore, the YOLO-L has higher detection precision under the condition of lower weight and parameter quantity.

In one embodiment, in the generation process of the first fused image, the operation of transposing convolution is replaced by an up-sampling operation of nearest neighbor interpolation using a network architecture of YOLO-S.

In the embodiment, the image target detection method uses a YOLO-S network design, and the structure of the YOLO-S is shown in fig. 7, and includes four stages of feature extraction, feature fusion, prediction, and post-processing. In the figure, nerest represents an upsampling operation of Nearest neighbor interpolation.

In the feature extraction stage, the YOLO-S extracts deep features by using a CBS structure, a CSPBottleneck residual error structure and an SPPF structure and outputs three feature maps with different sizes; in the feature fusion stage, the YOLO-S takes the PANet as a feature fusion network to fuse the features extracted in the previous stage; in the prediction and post-processing stages, YOLO-S predicts the feature maps of three sizes, detects targets of multiple scales through post-processing such as an NMS algorithm and the like, and summarizes target frames of different scales to obtain a detection result.

To achieve a smaller number of parameters and a smaller weight, YOLO-S uses the CSPGhostNeck residual structure fusion feature and performs the upsampling operation using nearest neighbor interpolation. Compared with CSPBottleneck, the CSPGhostNeck has the advantages of less parameter quantity, smaller weight and simpler calculation; compared with the transposition convolution, the nearest neighbor interpolation has no learnable parameters, and the number and the weight of the parameters are not increased. Therefore, the YOLO-S parameter is less, the weight is smaller, and the light weight level is higher.

In one embodiment, in order to verify the effectiveness of the CSPGhostNeck residual error structure and the transposed convolution, the CSPGhostNeck residual error structure and the transposed convolution are added into a reference network to carry out an ablation experiment; to verify the performance of YOLO-L, YOLO-M and YOLO-S, an algorithmic comparison experiment was performed.

(1) Data set VisDrone2021-DET

VisDrone2021-DET is obtained by launching an aerial image data set of an unmanned aerial vehicle by AISKYEYE team of Tianjin university and carrying cameras of different models by the unmanned aerial vehicle for shooting under different scenes, different weather and different illumination conditions. Labeling a data set, wherein the data set comprises 10 classes: pedestrians, people, bicycles, cars, vans, trucks, tricycles, tricycle with rain fly, buses, motorcycles. The data set VisDrone2021-DET has non-uniform distribution of samples in a long tail shape, the most samples are cars, 144625, and the least samples are tricycles with raincoats and only 3244. The VisDrone2021-DET is divided into a training set, a verification set and a test set, wherein the training set comprises 6471 images, the verification set comprises 548 images and the test set comprises 1610 images. Fig. 8 shows a partial image of the data set VisDrone 2021-DET.

As can be seen from the figure, one part of the image is shot in the daytime, the other part of the image is shot in the evening, and the difference between the brightness and the darkness of the image is large; part of the object is not completely cut off by entering the viewfinder; targets are small and dense, and the situation of being shielded occurs sometimes; the distance between the target and the camera is different, and the number span of the pixels occupied by the target is large; due to differences in illumination intensity, shooting distance, camera model and the like, partial images are fuzzy and low in resolution. The objective detection of the data sets VisDrone2021-DET, taking into account the influence of many factors, is a very challenging task.

(2) Data set CARPK

The CARPK is an unmanned aerial vehicle aerial parking lot data set, only one type of automobile is provided, and nearly 90000 automobiles are included; collected by the PHANTOM 3 in 4 different parking lots at 40 m high altitude. For car objects that are truncated by the image boundary, CARPK labels them as long as they can be identified as cars. CARPK is divided into a training set, a verification set and a test set, wherein the training set comprises 1120 images, the verification set comprises 125 images, the test set comprises 203 images, and the data volume is small. Fig. 9 shows a partial image of the data set CARPK.

As can be seen from the figure, partial images are over-exposed, and the difference of image brightness is large; targets are small and dense, but there is a distinct boundary between targets; part of the objects can not enter the viewfinder completely and are intercepted by the image boundary. The target detection is carried out on the data set CARPK, and the target needs to be accurately positioned.

Experiment setting and evaluation indexes:

for the three proposed lightweight unmanned aerial vehicle aerial image target detection networks, the batch size is set to be 32, a random gradient descent SGD is selected as an optimizer, and a loss function obtained by adding confidence coefficient loss, classification loss and regression loss is used for training on a single GPU. Training on a data set CARPK, and setting the training iteration number to be 200; training is performed on the data set VisDrone2021-DET, setting the number of training iterations to 300. And for other networks, training according to a training strategy provided by the network authority.

The evaluation indexes include parameters, weights and mAP. The parameter number is the number of trainable parameters of the network, the greater the parameter number, the heavier the network, and vice versa the lighter the network. The weight is the size of a weight file of the network, is the most intuitive embodiment of whether the network is light, and is related to whether the network can be deployed to the mobile terminal equipment. mAP is Mean Average Precision, with one AP value per class, mAP being the Average AP value of all classes. The mAP is an important evaluation index of a target detection task and can reflect the detection effect of the network, and the higher the mAP value is, the better the detection effect of the network is.

Ablation experiment:

the reference network selected by the invention is YOLOv5n, ablation experiments are carried out on the data sets CARPK and VisDrone2021-DET, and the size of the input image is 640 × 640. Respectively using a CSPGhostNeck residual error structure and introducing transposition convolution in a reference network, and verifying the functions of the CSPGhostNeck residual error structure and the transposition convolution; and (3) simultaneously using a CSPGhostNeck residual error structure and introducing a transposed convolution in a reference network to verify the mutual exclusion of the CSPGhostNeck residual error structure and the transposed convolution.

(1) Ablation experiments with CSPGhostneck residual structure and transposed convolution

Experiments were performed on the data set CARPK using the CSPGhostNeck residual structure and the introduction of the transposed convolution, respectively, with the experimental results shown in table 1. The method comprises the steps of obtaining a reference network, generating a CSPGhostNeck residual structure, and introducing a transposition convolution into the reference network, wherein the reference network is represented by Baseline, the CSPGhostNeck represents that the CSPGhostNeck residual structure is used in a feature fusion stage by the reference network, and the reference network and Deconv represent that the transposition convolution is introduced into the reference network.

Table 1 ablation experimental results on data set CARPK

As can be seen from the table, the amount of parameter Baseline + CSPGhostNeck is reduced by 0.28M, the weight is reduced by 0.59MB, and the mAP is reduced by 2.0% compared to Baseline. The CSPGhostNeck residual error structure can reduce the number of parameters and weight, and the reason is that the CSPGhostNeck respectively sends input images into two channels, and the result of splicing the two channels is used as output, so that the dimensionality of a middle characteristic diagram is reduced; CSPGhostNeck contains multiple GhostNeck structures that are depth convolved to generate majority feature maps, reducing the number and weight of parameters. However, the lightweight CSPGhostNeck residual structure degrades the performance of the detection network.

Baseline + Deconv is that mAP is improved by 1.2%, while parameter amount is increased by 0.19M, and weight is increased by 0.37MB. The detection precision of the network can be improved by the transposition convolution, and the detection precision is improved because the transposition convolution enables the network to learn a proper up-sampling rule, so that the image retains more complete semantic information, and the feature fusion is better performed. However, the transposed convolution requires the assistance of a convolution kernel, so introducing the transposed convolution increases the parameters and weights of the network.

Experiments were performed on the data sets VisDrone2021-DET using the CSPGhostNeck residual structure and introducing the transposed convolution, respectively, with the experimental results shown in table 2.

TABLE 2 results of ablation experiments on data set VisDrone2021-DET

It can be seen from the table that, when experiments are performed on the data sets VisDrone2021-DET, the parameters and weights of the network are increased because the data set CARPK has only one category, and the data sets VisDrone2021-DET have 10 categories, resulting in different prediction heads, and the parameters and weights of the network are increased. The mAP based on the data set VisDrone2021-DET is obviously lower than the mAP based on the data set CARPK, because the data set VisDrone2021-DET has numerous categories, complex background and the like, and the detection difficulty is high. From the general point of view, the same trend of change exists on the data sets VisDrone2021-DET and the data set CARPK by respectively using the CSPGhostNeck residual error structure and introducing the transposition convolution, namely the CSPGhostNeck residual error structure is used, the parameter quantity of the network is reduced, the weight is reduced, and the mAP is reduced; and by introducing the transposition convolution, the parameter quantity of the network is increased, the weight is increased, and the mAP is improved.

The experiment result shows that the CSPGhostNeck residual error structure can effectively reduce the parameter quantity and weight of the network on the premise of sacrificing a small amount of detection precision; the detection precision of the network can be effectively improved on the premise of increasing a small quantity of parameters and weights by transposition convolution; CSPGhostNeck and transposed convolution show the same characteristics in different data sets and have good applicability.

(2) Mutual exclusivity verification of CSPGhostNeck residual structure and transposed convolution

Experiments were performed on the data set CARPK using the CSPGhostNeck residual structure and introducing the transposed convolution, with the results shown in table 3.

Table 3 mutual exclusion verification test results on data set CARPK

As can be seen from the table, the amount of parameters Baseline + CSPGhostneck + Deconv was reduced by 0.1M, the weight was reduced by 0.22MB, and the mAP was reduced by 1.5% compared to Baseline. As can be seen from Table 1, when the CSPGhostNeck residual structure is used alone, the parameter quantity is reduced by 0.28M, the weight is reduced by 0.59MB, and the mAP is reduced by 2.0%; and the transposition convolution is independently introduced, the parameter quantity is increased by 0.19M, the weight is increased by 0.37MB, and the mAP is improved by 1.2 percent. The difference between the two is that the parameter amount is reduced by 0.09M, the weight is reduced by 0.22MB, and the mAP is reduced by 0.8%, which is approximately the same change as the CSPGhostNeck residual structure and the introduced transpose convolution are used simultaneously. Structurally, CSPGhostNeck is used for replacing CSPBottleNock fusion characteristics, so that the parameter quantity and weight are reduced, and the mAP is reduced; the up-sampling is completed by using the transposition convolution instead of the nearest neighbor interpolation, the parameter quantity and the weight are increased, and the mAP is improved. The CSPGhostNeck residual structure and the transposed convolution are two different network elements, so the effects of the two can be added, and the effect used at the same time is equal to the sum of the effects used separately.

Experiments were performed on the data sets VisDrone2021-DET using the CSPGhostNeck residual structure and introducing the transposed convolution, with the experimental results shown in table 4.

TABLE 4 results of mutual exclusion verification experiments on data set VisDrone2021-DET

As can be seen from the table, the results of the experiments based on the data set VisDrone2021-DET have the same variations as the results of the experiments based on the data set CARPK: compared with Baseline, the CSPGhostNeck residual error structure is singly used, the parameter quantity is reduced by 0.28M, the weight is reduced by 0.59MB, and the mAP is reduced by 1.3%; the transposition convolution is independently introduced, the parameter quantity is increased by 0.19M, the weight is increased by 0.38MB, and the mAP is improved by 0.2%; and meanwhile, by using a CSPGhostNeck residual error structure and introducing transposition convolution, the parameter quantity is reduced by 0.1M, the weight is reduced by 0.21MB, and the mAP is reduced by 0.8%. The effect of using the CSPGhostNeck residual error structure and introducing the transposition convolution is equal to the sum of the effect of using the CSPGhostNeck residual error structure and introducing the transposition convolution independently.

The experiment result shows that the CSPGhostNeck residual error structure and the transposed convolution do not have mutual exclusion, the effects of the CSPGhostNeck residual error structure and the transposed convolution can be added, and the using effect is approximately equal to the sum of the effects used independently; the method has the same performance on different data sets and has good applicability.

Comparative experiment:

comparative experiments were performed on the data sets CARPK and VisDrone2021-DET, with the input image size set at 640 x 640. The three lightweight unmanned aerial vehicle aerial image target detection networks proposed herein are compared with existing algorithms, and the algorithms involved in the comparison include YOLOv3-Tiny, YOLOv4-Tiny, YOLOX-Nano, YOLOv5n, YOLO-L, YOLO-M and YOLO-S.

(1) Comparative experiments based on dataset CARPK

Comparative experiments were performed on the data set CARPK and the results are shown in table 5.

Table 5 comparative experimental results on data set CARPK

In the table, YOLOv5n is the reference network, YOLOv3-Tiny, YOLOv4-Tiny, YOLOX-Nano are the rest of the networks. As can be seen from the table, the parameters and weights of the network are not in a one-to-one correspondence relationship, because different weights are generated when the parameters are encoded in different encoding modes. In practical applications, the weight is the more important influencing factor.

First, YOLO-L, YOLO-M, YOLO-S is compared to a reference network YOLOv5 n. Compared with YOLOv5n, YOLO-L reduces parameters and weight, and improves mAP; the parameters, weights and mAP of YOLO-M and YOLO-S are reduced. Specifically, from a percentage point of view, the parameter amount of YOLO-L is reduced by 1.1%, the weight is reduced by 0.8%, and the mAP is increased by 1.0%. YOLO-L completes the up-sampling operation by using the transposition convolution, and the mAP is improved; however, the transposed convolution has a convolution kernel, and in order to offset the influence of the transposed convolution on the network magnitude, YOLO-L completes downsampling by using a Ghost CBS structure to adjust dimensionality and Maxpool maximum pooling operation, thereby successfully reducing the parameter number and the weight. For YOLO-M and YOLO-S, the parameter quantity of YOLO-M is reduced by 5.7%, the weight is reduced by 5.7%, and the mAP is reduced by only 1.8%; the parametric amount of YOLO-S was reduced by 15.9%, the weight was reduced by 15.4%, and the mAP was reduced by only 2.4%. The YOLO-S uses a CSPGhostNeck residual error structure to reduce parameter quantity and weight, but the CSPGhostNeck has lower feature processing capacity than CSPBotttleneckk, so that mAP is reduced; the YOLO-M simultaneously uses a CSPGhostNeck residual error structure and introduces transposition convolution, and the precision is considered while pursuing light weight, so the magnitude of the YOLO-M is smaller than that of the YOLO-L, and the detection precision is higher than that of the YOLO-S.

Then, YOLO-L, YOLO-M, YOLO-S is compared to the rest of the network. Wherein, the parameter quantity of YOLOX-Nano is minimum and is only 0.90M; the weight of YOLO-S is the smallest, only 3.24MB; the mAP of YOLO-L was the largest, reaching 82.8%. In order to visually compare the performance of each network, a scatter diagram is drawn with the weights and the mAP as horizontal and vertical coordinates, as shown in FIG. 10.

In fig. 10, the closer the network is to the upper left corner, the smaller the weight, the higher the mAP, and the stronger the overall performance. As can be seen from the figure, YOLOv4-Tiny has the largest weight and the lowest mAP, and is the network with the weakest comprehensive performance; YOLOX-Nano has smaller weight, the mAP ranks second to last, while the mAP of YOLOv3-Tiny is higher, and the mAP ranks second to last in weight belongs to a network with weaker comprehensive performance. YOLO-L, YOLO-M, YOLO-S is small in weight and high in mAP, is close to the upper left corner of a scatter diagram, and is a network with the strongest comprehensive performance.

(2) Comparative experiments on data set VisDrone2021-DET

Comparative experiments were performed on the data set VisDrone2021-DET, the results of which are shown in Table 6.

TABLE 6 comparative experimental results on data set VisDrone2021-DET

The experimental results based on the data set VisDrone2021-DET have the same trend of change as the experimental results based on the data set CARPK compared to the baseline network YOLOv5 n: YOLO-L reduces the quantity and weight of parameters and improves mAP; and YOLO-M and YOLO-S sacrifice part of mAP, and greatly reduce the quantity and weight of parameters.

Compared to the rest of the network, the VisDrone2021-DET based data set has different experimental results than the CARPK based data set. Wherein, the parameter quantity of YOLOX-Nano is minimum and is only 0.90M; the YOLO-S has the smallest weight, only 3.26MB; the mAP of YOLO-L is maximal, reaching 13.1%. And drawing a scatter diagram by taking the weight and the mAP as horizontal and vertical coordinates, as shown in FIG. 11.

As can be seen from the graph, the mAP of YOLOv3-Tiny is the lowest, the weight is the second to last, while the weight of YOLOv4-Tiny is the largest, the mAP is the second to last, and the mAP belongs to a network with weaker comprehensive performance; the mAP of YOLOX-Nano is high, the weight is small, and the comprehensive performance is strong. YOLO-L, YOLO-M, YOLO-S is small in weight and high in mAP, is close to the upper left corner of a scatter diagram, and is outstanding in comprehensive performance. In two unmanned aerial vehicle aerial photography data sets, the comprehensive performance fluctuation of YOLOv3-Tiny, YOLOv4-Tiny and YOLOX-Nano is large, and YOLO-L, YOLO-M, YOLO-S is stable at the upper left corner of a scatter diagram, so that the applicability is strong.

Experimental results show that YOLO-L not only reduces the magnitude of the network, but also improves the detection precision, and is an excellent lightweight unmanned aerial vehicle aerial image target detection network; the YOLO-M and YOLO-S can greatly reduce the parameters and the weight of the network under the condition of sacrificing part of detection precision, and are high-light unmanned aerial vehicle aerial image detection networks. YOLO-L, YOLO-M and YOLO-S are small in weight and high in mAP, stable in performance in different unmanned aerial vehicle aerial photography data sets, and superior in comparison with other networks. FIG. 12 shows the results of visual detection of YOLO-L, YOLO-M, YOLO-S based on data set CARPK.

Therefore, the invention aims at the problems of heavy weight and large parameter quantity of the high-precision detection network of the aerial image of the unmanned aerial vehicle, researches the lightweight target detection network of the aerial image of the unmanned aerial vehicle according to the design concept and the network structure of the lightweight convolutional neural network, and provides three lightweight target detection networks of YOLO-L, YOLO-M and YOLO-S. In the composition of the network, a CSPGhostNeck residual error structure is designed and a transposed convolution is introduced. The CSPGhostNeck is composed of CSPNet and GhostNeck, and can effectively reduce the parameter quantity and weight of the detection network. The transposition convolution can enable the network to explore a proper up-sampling rule in the training process, so that the small-size characteristic diagram retains more complete characteristic information, the characteristic fusion is more effective, and the detection precision of the network is higher. In the three proposed lightweight networks, the YOLO-L introduces the transposition convolution, and the detection precision is highest; the YOLO-S uses a CSPGhostNeck residual error structure, and has the minimum parameter quantity and the minimum weight; and the YOLO-M simultaneously uses a CSPGhostNeck residual error structure and introduces a transposition convolution, so that the detection precision is improved while the network is light. Experimental results show that the three designed lightweight networks have good effects in the target detection task of the aerial image of the unmanned aerial vehicle.

Referring to fig. 13, an embodiment of the present invention further provides an image target detection system, including: the system comprises a feature extraction module 1, a feature fusion module 2, a prediction module 3 and a post-processing module 4; the feature extraction module 1 is used for extracting features of an input image to obtain three feature maps; the feature fusion module 2 is used for performing feature fusion on all feature maps to obtain a fusion image; the prediction module 3 is used for predicting the fused image to obtain a predicted image; the post-processing module 4 is used for performing post-processing on the predicted image to obtain a final detection result of the input image;

in one embodiment, the feature extraction module 1 comprises: the down-sampling unit is used for carrying out five times of down-sampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes; and the feature retaining unit is used for retaining the last three feature maps which are respectively a first feature map, a second feature map and a third feature map.

In one embodiment, the feature fusion module 2 includes: a first fused image generating unit, a second fused image generating unit and a third fused image generating unit; the first fusion image generation unit is used for adjusting dimensionality of the third feature map to obtain a first image, performing transposition convolution on the first image, splicing the first image with the second feature map, performing the transposition convolution on the second image through a first residual error structure, adjusting dimensionality to obtain a second image, splicing the second image with the first feature map, and performing the second residual error structure to obtain a first fusion image; the second fused image generating unit is used for down-sampling the first fused image, splicing the first fused image with a second image, and obtaining a second fused image through a third residual error structure; and the third fused image generating unit is used for down-sampling the second fused image, splicing the second fused image with the first image, and obtaining a third fused image through a fourth residual error structure.

Performing convolution and matrix rearrangement operations on the first fusion image, the second fusion image and the third fusion image in the three prediction heads respectively, and adjusting the characteristic dimensionality of each fusion image to a uniform numerical value to obtain three prediction images;

the prediction module 3 includes: the device comprises a threshold setting unit, a cross-comparison comparing unit, a priori frame screening unit and a summarizing unit; the threshold setting unit is used for setting a confidence threshold and removing prior frames with confidence degrees smaller than the threshold in the three predicted images; the cross comparison unit is used for setting IoU threshold values by using an NMS algorithm and respectively comparing cross comparisons of prior frames and real frames in the three predicted images; the prior frame screening unit is used for screening out a prior frame with the highest numerical value from the prior frames with the intersection-sum ratio higher than the IoU threshold; the summarizing unit is used for summarizing all the prior frames screened out by the 3 prediction heads to obtain a final detection result of the input image.

In one embodiment, the first, second, third and fourth residual structures each use the same residual structure, the residual structure comprising a first channel structure and a second channel structure:

the first channel structure is a Conv structure of 1*1;

the second channel structure includes: a Ghost CBS structure, a Ghost neck structure, a Concat structure and a 1 × 1cbs structure;

the structure of GhostNeck includes: a Ghost CBS structure, a SEnet structure, a Concat structure and a Ghost CBS structure, wherein the input of the Ghost CBS structure is also one input of the Concat structure;

the SEnet structure includes: a Flatten structure, a 1 × 1conv structure, a Multiply structure, an input of the Flatten structure, also an input of the Multiply structure;

the Ghost CBS structure comprises: 1 × 1cbs structure, 5 × 5dwcbs structure, concat structure, the output of 1 × 1cbs structure being one input of Concat structure;

the Ghost CBS structures of the first channel structure and the second channel structure have the same input, and the output of the first channel structure is the input of the Concat structure of the second channel structure;

the image processing method comprises the steps of obtaining a 1 × 1Conv structure, splicing images, reducing dimensionality of the images, and restoring dimensionality of the images, wherein the 1 × 1Conv structure is used for conducting 1*1 convolution on the images, the Concat structure is used for splicing the images, the Ghost CBS structure is used for reducing dimensionality of the images, and the 1 × 1CBS structure is used for restoring dimensionality of the images.

In the generation process of the first fusion image, a network architecture of YOLO-M is used, and the CBS structure is used for adjusting the dimensionality; or in the generation process of the first fusion image, a YOLO-L network architecture is used, and Ghost CBS structures are used for adjusting the dimensionality; or, using a network architecture of YOLO-S, in the generation process of the first fused image, the operation of transposing convolution is replaced by an upsampling operation of nearest neighbor interpolation.

In one embodiment, a method of processing an image using a residual structure includes: performing 1*1 convolution on an input image in a first channel, and reducing the dimensionality of the image to half of the original dimensionality to obtain a first image; and reducing dimensionality and dimensionality weighting of the input image in a second channel, splicing the input image with the first image, fusing the characteristics, and obtaining a second image with unchanged width, height and dimensionality compared with the input image.

In one embodiment, the dimensionality reduction processing of the input image includes: and reducing the dimension of the input image to be half of the input dimension to obtain a dimension-reduced image.

In one embodiment, dimensionally weighting the low-dimensional image comprises: flattening the low-dimensional image; convolving the flattened low-dimensional image with 1*1 to reduce dimensionality; activating a low-dimensional image by using a SiLU function; recovering the low dimensional image using a convolution of 1*1; using Sigmoid activation to obtain a dimension weight; and multiplying the input image by the dimension weight to obtain a dimension weighted output image.

An embodiment of the present application provides an electronic device, please refer to fig. 14, which includes: a memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602, which when executed by the processor 602, implement the image target detection method described in the foregoing.

Further, the electronic device further includes: at least one input device 603 and at least one output device 604.

The memory 601, the processor 602, the input device 603, and the output device 604 are connected by a bus 605.

The input device 603 may be a camera, a touch panel, a physical button, a mouse, or the like. The output device 604 may be embodied as a display screen.

The Memory 601 may be a high-speed Random Access Memory (RAM) Memory, or a non-volatile Memory (non-volatile Memory), such as a disk Memory. The memory 601 is used for storing a set of executable program code, and the processor 602 is coupled to the memory 601.

Further, an embodiment of the present application further provides a computer-readable storage medium, which may be disposed in the electronic device in the foregoing embodiments, and the computer-readable storage medium may be the memory 601 in the foregoing. The computer-readable storage medium has stored thereon a computer program which, when executed by the processor 602, implements the image object detection method described in the foregoing embodiments.

Further, the computer-readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory 601 (ROM), a RAM, a magnetic disk, or an optical disk.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present invention is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required of the invention.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In view of the above description of the image object detection method, system, electronic device and storage medium provided by the present invention, those skilled in the art will recognize that there may be variations in the embodiments and applications of the concepts according to the embodiments of the present invention.

Claims

1. An image object detection method, comprising:

carrying out feature extraction on an input image to obtain three feature maps;

performing feature fusion on all feature maps to obtain a fusion image;

predicting the fused image to obtain a predicted image;

post-processing the predicted image to obtain a final detection result of the input image;

the method comprises the following steps of performing feature extraction on an input image to obtain three feature maps:

and performing five times of downsampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes, and reserving the last three feature maps which are respectively a first feature map, a second feature map and a third feature map.

2. The image object detection method according to claim 1,

performing feature fusion on all feature maps to obtain a fused image, wherein the feature fusion comprises the following steps:

adjusting dimensionality of the third feature map to obtain a first image, performing transposition convolution on the first image, splicing the first image with the second feature map, performing a first residual error structure, adjusting dimensionality to obtain a second image, performing transposition convolution on the second image, splicing the second image with the first feature map, and performing a second residual error structure to obtain a first fusion image;

the first fused image is downsampled, is spliced with a second image, and passes through a third residual error structure to obtain a second fused image;

and downsampling the second fused image, splicing the second fused image with the first image, and obtaining a third fused image through a fourth residual error structure.

3. The image object detecting method according to claim 2,

predicting the fused image to obtain a predicted image comprises the following steps:

post-processing the predicted image to obtain a final detection result of the input image comprises:

setting a confidence threshold, and removing prior frames with confidence degrees smaller than the threshold in the three predicted images;

setting IoU threshold values by using NMS algorithm, and respectively comparing the intersection ratio of the prior frame and the real frame in the three predicted images;

screening out a prior frame with the highest numerical value from the prior frames with the intersection ratio higher than the IoU threshold;

and summarizing all prior frames screened out by the 3 prediction heads to obtain a final detection result of the input image.

4. The image object detecting method according to claim 2,

the first residual structure, the second residual structure, the third residual structure and the fourth residual structure all use the same residual structure, and the residual structures include a first channel structure and a second channel structure:

the first channel structure is a Conv structure of 1*1;

the second channel structure includes: a Ghost CBS structure, a Ghost nack structure, a Concat structure and a 1 × 1cbs structure;

the SENET structure comprises: a Flatten structure, a 1 × 1conv structure, a Multiply structure, an input of the Flatten structure, also an input of the Multiply structure;

5. The image object detection method according to claim 4,

in the generation process of the first fusion image, a network architecture of YOLO-M is used, and the CBS structure is used for adjusting the dimensionality; alternatively, the first and second electrodes may be,

in the generation process of the first fusion image, a YOLO-L network architecture is used, and Ghost CBS structures are used for adjusting dimensions; alternatively, the first and second electrodes may be,

in the generation process of the first fusion image, the network architecture of YOLO-S is used, and the operation of the transposition convolution is replaced by the up-sampling operation of nearest neighbor interpolation.

6. The image object detecting method according to claim 2,

the method of processing an image using a residual structure includes:

performing 1*1 convolution on an input image in a first channel, and reducing the dimensionality of the image to be half of the original dimensionality to obtain a first image;

reducing dimensionality and dimensionality weighting of the input image in a second channel, splicing the input image with the first image, and fusing features to obtain a second image which is not changed in width and height and dimensionality compared with the input image;

reducing dimensions and weighting dimensions of the input image in a second channel comprises:

carrying out dimensionality reduction processing on an input image to obtain a low-dimensional image;

the low-dimensional image is dimension weighted.

7. The image object detecting method according to claim 6,

the dimensionality reduction processing of the input image includes:

reducing the dimension of the input image to be half of the input dimension to obtain a dimension-reduced image;

dimensionally weighting the low-dimensional image includes:

flattening the low-dimensional image;

convolving the flattened low-dimensional image with 1*1 to reduce dimensionality;

activating a low-dimensional image by using a SiLU function;

recovering the low dimensional image using a convolution of 1*1;

using Sigmoid activation to obtain a dimension weight;

and multiplying the input image by the dimension weight to obtain a dimension-weighted output image.

8. An image object detection system, comprising:

the characteristic extraction module is used for extracting the characteristics of the input image to obtain three characteristic graphs;

the characteristic fusion module is used for carrying out characteristic fusion on all the characteristic graphs to obtain a fusion image;

the prediction module is used for predicting the fused image to obtain a predicted image;

the post-processing module is used for post-processing the predicted image to obtain a final detection result of the input image;

the feature extraction module includes: the down-sampling unit is used for carrying out five times of down-sampling on the input image by using a pre-constructed feature extraction network to obtain five feature maps with different sizes; and the feature retaining unit is used for retaining the last three feature maps, wherein the last three feature maps are respectively a first feature map, a second feature map and a third feature map.

9. An electronic device, comprising: memory, a processor, on which a computer program is stored that is executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.