CN115346115A

CN115346115A - Image target detection method, device, equipment and storage medium

Info

Publication number: CN115346115A
Application number: CN202210897001.5A
Authority: CN
Inventors: 李岩山; 刘文军; 罗文寒; 王磊
Original assignee: Shenzhen University
Current assignee: Shenzhen University
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-11-15

Abstract

The embodiment of the invention provides an image target detection method, an image target detection device, image target detection equipment and a storage medium, and relates to the technical field of artificial intelligence. The image target detection method comprises the steps of obtaining an image to be detected, and utilizing a coding module to extract the characteristics of the image to be detected to obtain at least one coding characteristic diagram; carrying out dense connection on at least one coding feature map by utilizing a decoding module to obtain a decoding feature map; performing characteristic pyramid transformation on the decoded characteristic graph by using a characteristic enhancement model to obtain an enhanced characteristic graph; and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result. In the embodiment, the coding and decoding modules are used for carrying out dense connection on the coding feature maps of the images to be detected to obtain the decoding feature maps, the feature extraction capability is enhanced, the feature pyramid transformation is used for carrying out feature enhancement on the decoding feature maps, the detection capability of a subsequent detection head prediction model on a multi-scale target is effectively improved, the detection performance of the SAR target image detection algorithm is improved, and the problem of operational capability tension is relieved.

Description

Image target detection method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an image target detection method, device, equipment and storage medium.

Background

Synthetic Aperture Radar (SAR) is an active microwave remote sensing imaging Radar and can observe the earth surface all weather, so that SAR plays an important role in ocean monitoring and marine traffic supervision.

In the related technology, the target is detected in the SAR image, and on the basis of the target detection technology based on deep learning, the target detection method based on deep learning is also introduced in the SAR image target detection. Although the SAR image target detection based on deep learning achieves a good effect, due to the fact that the size and the scale of the detection targets are different in the practical application scene, the problems of computational tension and low performance still exist in the detection of multi-scale targets.

Disclosure of Invention

The embodiment of the invention mainly aims to provide an image target detection method, device, equipment and storage medium, which can improve the detection performance of an SAR target image detection algorithm and relieve the problem of insufficient computing capability.

In order to achieve the above object, a first aspect of an embodiment of the present invention provides an image target detection method, including:

the image detection model that obtains waiting to examine the image and will wait to examine the image input training, the image detection model includes: the device comprises a feature extraction model, a feature enhancement model and a detection head prediction model, wherein the feature extraction model comprises the following components: an encoding module and a decoding module;

utilizing a coding module to perform feature extraction on an image to be detected to obtain at least one coding feature map;

carrying out dense connection on at least one coding feature map by using a decoding module to obtain a decoding feature map;

performing characteristic pyramid transformation on the decoded characteristic graph by using a characteristic enhancement model to obtain an enhanced characteristic graph;

and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result.

In some embodiments, densely connecting the at least one encoding feature map with the decoding module to obtain a decoding feature map comprises:

according to the dimension of the first coding feature graph, performing feature graph jumping on at least one coding feature graph to obtain a first decoding feature graph;

according to the dimension of the second decoding characteristic diagram, carrying out characteristic diagram jumping on the first decoding characteristic diagram and at least one coding characteristic diagram to obtain a second decoding characteristic diagram;

and according to the dimension of the coding feature map, performing feature map jumping on the second decoding feature map and at least one coding feature map to obtain a decoding feature map.

In some embodiments, the at least one feature map comprises: the method comprises the following steps of performing feature map hopping on at least one coding feature map according to the dimension of the first coding feature map to obtain a first decoding feature map, wherein the first coding feature map, the second coding feature map, the third coding feature map and the fourth coding feature map comprise the following steps:

down-sampling the first coding feature map to obtain a first down-sampling result;

carrying out down-sampling on the second coding feature map to obtain a second down-sampling result;

performing deconvolution on the fourth coding feature map to obtain a fourth deconvolution result;

and multiplying each weight parameter in the preset first weight parameter set by the corresponding first downsampling result, the corresponding third coding feature map, the corresponding second downsampling result and the corresponding fourth deconvolution result respectively, and then adding the multiplied results to obtain a first decoding feature map.

In some embodiments, the dimension of the second decoded feature map, and the feature map hopping of the first decoded feature map and the at least one encoded feature map to obtain the second decoded feature map, includes:

performing deconvolution on the first decoding feature graph to obtain a first deconvolution result;

performing upsampling on the third coding feature map to obtain a third upsampling result;

and multiplying each weight parameter in a preset second weight parameter set by the corresponding first deconvolution result, the second coding characteristic graph, the first down-sampling result and the third up-sampling result respectively, and then adding to obtain a second decoding characteristic graph.

In some embodiments, performing feature map hopping on the second decoded feature map and the at least one encoded feature map according to the dimension of the encoded feature map to obtain a decoded feature map includes:

performing deconvolution on the second decoding feature graph to obtain a second deconvolution result;

upsampling the second coding characteristic diagram to obtain a second upsampling result;

and multiplying each weight parameter in a preset third weight parameter set by a corresponding second deconvolution result, a corresponding first coding feature map, a corresponding second upsampling result and a corresponding third upsampling result respectively, and then adding the multiplied weights to obtain a decoding feature map.

In some embodiments, performing feature pyramid transformation on the decoded feature map using a feature enhancement model to obtain an enhanced feature map includes:

compressing the decoding characteristic diagram to obtain at least one compressed characteristic diagram;

convolving the at least one compressed characteristic map to obtain at least one compressed convolution characteristic map;

and aligning and upsampling at least one compressed convolution characteristic graph according to the dimension of the enhanced characteristic graph to obtain the enhanced characteristic graph.

In some embodiments, the detection head prediction model comprises: the target central position regression prediction module, the target central point offset regression prediction module and the target size regression prediction model are used for predicting the enhanced feature map by using the detection head prediction model to obtain a target detection result, and the target detection result comprises the following steps:

predicting the enhanced feature graph by using a target central position regression prediction module to obtain a target central position output value;

predicting the enhanced feature graph by using a target center point offset regression prediction module to obtain a target center point offset output value;

predicting the enhanced feature graph by using a target size regression prediction model to obtain a target size output value;

and obtaining a detection target according to the comparison result of the target central position output value and a preset confidence threshold value, wherein the target detection result is a target central position output value, a target central point offset output value and a target size output value which correspond to the detection target.

In order to achieve the above object, a second aspect of the present invention is an image detection apparatus comprising:

the image acquisition unit is used for acquiring an image to be detected and inputting the image to be detected into a trained image detection model, and the image detection model comprises: the device comprises a feature extraction model, a feature enhancement model and a detection head prediction model, wherein the feature extraction model comprises: an encoding module and a decoding module;

the encoding unit is used for extracting the features of the image to be detected by using the encoding module to obtain at least one encoding feature map;

the decoding unit is used for carrying out dense connection on at least one coding characteristic diagram by utilizing a decoding module to obtain a decoding characteristic diagram;

the characteristic enhancement unit is used for performing characteristic pyramid transformation on the decoded characteristic graph by utilizing the characteristic enhancement model to obtain an enhanced characteristic graph;

and the detection unit is used for predicting the enhanced feature map by using the detection head prediction model to obtain a target detection result.

To achieve the above object, a third aspect of the present invention provides an electronic apparatus comprising:

at least one memory;

at least one processor;

at least one program;

the programs are stored in the memory and the processor executes at least one of the programs to implement the method of the invention as described above in relation to the first aspect.

To achieve the above object, a fourth aspect of the present invention proposes a storage medium which is a computer-readable storage medium storing computer-executable instructions for causing a computer to execute:

as in the method of the first aspect above.

According to the image target detection method, the image target detection device, the image target detection equipment and the storage medium, an image to be detected is obtained and input into a trained image detection model, and a coding module is utilized to perform feature extraction on the image to be detected to obtain at least one coding feature map; carrying out dense connection on at least one coding feature map by using a decoding module to obtain a decoding feature map; performing feature pyramid transformation on the decoded feature map by using a feature enhancement model to obtain an enhanced feature map; and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result. In the embodiment, the coding and decoding modules are used for carrying out dense connection on the coding feature maps of the images to be detected to obtain the decoding feature maps, so that the feature extraction capability is enhanced, and then the feature pyramid transformation is used for carrying out feature enhancement on the decoding feature maps, so that the detection capability of a subsequent detection head prediction model on a multi-scale target is effectively improved, the detection performance of the SAR target image detection algorithm is improved, and the problem of insufficient computing capability is solved.

Drawings

Fig. 1 is a flowchart of an image target detection method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an image detection model of an image target detection method according to yet another embodiment of the present invention.

Fig. 3 is a flowchart of an image target detection method according to another embodiment of the present invention.

Fig. 4 is a flowchart of an image target detection method according to another embodiment of the present invention.

Fig. 5 is a flowchart of an image target detection method according to another embodiment of the present invention.

Fig. 6 is a flowchart of an image target detection method according to another embodiment of the present invention.

Fig. 7 is a flowchart of an image target detection method according to another embodiment of the present invention.

Fig. 8 is a flowchart of an image target detection method according to another embodiment of the present invention.

FIG. 9 is a schematic diagram of an image target detection method according to another embodiment of the present invention.

Fig. 10 is a block diagram of an image detection apparatus according to an embodiment of the present invention.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

It is noted that while functional block divisions are provided in device diagrams and logical sequences are shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions within devices or flowcharts.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

First, several terms related to the present invention are analyzed:

synthetic Aperture Radar (SAR) is an active microwave remote sensing imaging Radar, and the SAR imaging technology is widely applied to the fields of geological disaster monitoring, ocean military monitoring, ocean search and rescue and the like according to the characteristics of all-time and all-weather. With the rapid development of satellite-borne SAR, various countries develop their own SAR technology, such as TerrasAR-X in Germany, RADARSAT-2 in Canada, and high score 3 in China.

SAR image target detection algorithm: with the development of the deep learning method, a large number of SAR image target detection networks based on deep learning are proposed, and by means of strong feature extraction capability and nonlinear mapping learning capability of the SAR image target detection networks, effects which cannot be compared with traditional methods (such as CFAR and variants thereof) are achieved. At present, SAR image target detection algorithms are directly or indirectly derived from visible light target detection methods. However, in the SAR image target detection, because the target in the SAR image has the multi-scale characteristic, the target detection method based on the deep learning needs to pay attention to the targets with different sizes at the same time, and according to the principle of the target detection based on the deep learning, the input image needs to be downsampled for many times to extract higher-level semantic features with larger receptive field, however, as the downsampling times of the input image are increased, the information of the multi-scale target is distributed on multiple layers of the target detection network, how to ensure that the target detection network can extract effective multi-scale target information, and how to effectively utilize the information on different layers of the target detection to perform the multi-scale target detection is a key for improving the multi-scale target detection performance, and is also a challenge brought to the target detection network based on the deep learning by the multi-scale target characteristic of the SAR image.

And (3) target detection algorithm: from the perspective of the detection process, the deep learning target detection algorithms can be mainly classified into two categories: anchor-box based target detection networks and anchor-box-free target detection networks. The target detection network based on the anchor frame decomposes a target detection task into two stages: a large number of candidate frames are predicted, and then the classifier is used for judging whether the candidate frames belong to the background or the target category. The target detection is regarded as a regression task based on the target detection network without the anchor frame (such as a YOLO series), and the target frame position and the target category are directly regressed. Compared with the target detection network based on the anchor frame, the target detection network based on the non-anchor frame has the advantages of simple model and quick calculation.

ResNest model: for object detection or image segmentation, etc., a Split-Attention module is introduced on the basis of the ResNet model, and the nature of Split-Attention can be understood as an Attention supervision mechanism of slices. The reseest model performs better in image classification on the ImageNet dataset, especially the reseest-50 model therein. For example, the model using ResNest-50 as the basic skeleton (e.g., the fast-RCNN model) is 3.08% higher than the model using ResNet-50 (e.g., mAP); the model using ResNest-50 as the basic skeleton (e.g., deeplabV3 model) is 3.02% higher than the model using ResNet-50 (e.g., mIOU).

Depth semantic segmentation model FPN (Feature Pyramid Network): the characteristic pyramid network is characterized in that a characteristic pyramid, namely characteristic graphs with different scales, is used for fusing characteristic information of different stages from Top to bottom by using a Top-down network similar to a decoder. The FPN model structure comprises an encoder and a decoder, the decoder adopts deconvolution to recover the resolution of the characteristic diagram, and corresponding layers of the encoder and the decoder are connected in series through a 1x1 convolution kernel.

In the related technology, the target is detected in the SAR image, and on the basis of the target detection technology based on deep learning, the target detection method based on deep learning is also introduced in the SAR image target detection. Although the SAR image target detection based on deep learning achieves a good effect, the problems of computational stress and low performance still exist in the face of practical application scenes.

The main reasons are the following: 1) Most SAR image target detection networks adopt a target detection network based on an anchor frame, although satisfactory effect is achieved on detection precision, a large number of calculation irrelevant to a predicted target can be introduced into a large number of prediction candidate frames in the intermediate process. 2) Most SAR image target detection networks adopt a target detection method based on an anchor frame, and a large amount of calculation related to the anchor frame is introduced, so that the model calculation is tense. 3) The skeleton network feature extraction capability of a part of SAR image target detection networks is low. 4) The sizes of the detection targets in the SAR image target detection network are different, the span is large, the detection performance of the SAR image target detection network in the related technology for the multi-scale target is poor, and the SAR image target detection task cannot be well completed.

Based on this, embodiments of the present invention provide an image target detection method, apparatus, device, and storage medium, in which an encoding and decoding module is used to perform dense connection on an encoding feature map of an image to be detected to obtain a decoding feature map, so that the feature extraction capability is enhanced, and then feature pyramid transformation is used to perform feature enhancement on the decoding feature map, so that the detection capability of a subsequent detection head prediction model on a multi-scale target is effectively improved, the detection performance of an SAR target image detection algorithm is improved, and the problem of lack of operational capability is alleviated.

Embodiments of the present invention provide an image target detection method, an image target detection apparatus, an image target detection device, and a storage medium, and specifically, an image target detection method in an embodiment of the present invention is described in the following embodiments.

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides an image target detection method, and relates to the technical field of artificial intelligence, in particular to the technical field of data mining. The image target detection method provided by the embodiment of the invention can be applied to a terminal, a server side and software running in the terminal or the server side. Wherein the terminal communicates with the server via a network. The image object detection method may be executed by a terminal or a server, or by the terminal and the server in cooperation.

In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, smart watch, or the like. The server can be an independent server, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data and artificial intelligence platforms and the like; or may be service nodes in a blockchain system, where a Peer-To-Peer (P2P) network is formed among the service nodes in the blockchain system, and the P2P Protocol is an application layer Protocol operating on a Transmission Control Protocol (TCP). The server may be installed with a server through which the server may interact with the terminal, for example, corresponding software is installed on the server, and the software may be an application for implementing the image target detection method, but is not limited to the above form. The terminal and the server may be connected through communication connection manners such as bluetooth, USB (Universal Serial Bus), or network, which is not limited herein.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an alternative flowchart of an image object detection method provided in an embodiment of the present invention, and the method in fig. 1 may include, but is not limited to, steps S110 to S150.

Step S110: and acquiring an image to be detected and inputting the image to be detected into the trained image detection model.

In an embodiment, an image detection model is first trained using a large amount of sample image data to obtain a trained image detection model. The sample image data may be a number of images containing or not containing the object, and the corresponding detection labels are: the target, other targets, and position information corresponding to the target.

In this embodiment, referring to fig. 2, the image detection model includes: the device comprises a feature extraction model, a feature enhancement model and a detection head prediction model, wherein the feature extraction model comprises: an encoding module and a decoding module, the detection head prediction model comprising: the system comprises a target center position regression prediction module, a target center point offset regression prediction module and a target size regression prediction model. In this embodiment, the image detection model is improved based on the ResNeSt-50 model as a basic skeleton.

Step S120: and performing feature extraction on the image to be detected by using a coding module to obtain at least one coding feature map.

In one embodiment, the ResNeSt-50 model forms a main body part of the coding module, the implementation of the coding process comprises four stages, and the feature extraction is carried out on the image I to be detected to obtain four coding feature maps F with different scales _encode For example, the first coding feature map F _encode1 A second coding feature pattern F _encode2 And a third encoding characteristic diagram F _encode3 And a fourth coding feature map F _encode4 And the sizes of the coding feature maps are respectively as follows: :128 × 128 × 256, 64 × 64 × 512, 32 × 32 × 1024 and 16 × 16 × 2048, encoding feature map F _encode Expressed as:

F _encode ＝{F _encode1 ,F _encode2 ,F _encode3 ,F _encode4 }＝ResNet50(I)

for example, in one embodiment, the encoding module implements an encoding process including five stages, one more initial encoding profile F on the basis of the four encoding profiles _encode0 The size is 128 × 128 × 64, and the coding feature map is represented as:

F _encode ＝{F _encode0 ,F _encode1 ,F _encode2 ,F _encode3 ,F _encode4 }＝ResNet50(I)

it is understood that, the number and the size of the coding feature maps in the above embodiments can be set according to actual requirements, and the above embodiments are only illustrative and do not represent to limit the number and the size of the coding feature maps.

By using the coding module, the image features can be more effectively extracted, and after the coding feature graph output by the coding module is obtained, the embodiment of the application needs to decode the image features.

Step S130: and carrying out dense connection on at least one coding feature map by utilizing a decoding module to obtain a decoding feature map.

In one embodiment, the coding feature map obtained by the coding module belongs to high-dimensional coding features, and the decoding module maps the high-dimensional coding features to decoding features rich in high-level semantic information. In this embodiment, the decoding process includes three convolution stages, a dense connection mode is used to obtain a decoding feature map according to at least one coding feature map, and finally the high-dimensional coding feature map is decoded into decoding feature maps of three scales, which are a first decoding feature map, a second decoding feature map, and a decoding feature map.

In this embodiment, dense connection refers to connecting features to each other on a channel to realize feature reuse, that is, connecting all layers to each other, specifically, each layer accepts all layers in front of it as its additional input, each layer is connected with all layers in front in channel dimension and is used as the input of the next layer, and the size of the feature map of each layer is the same when connecting, which makes the image detection model of the present application realize better performance in the case of less parameters and computation cost.

Referring to fig. 3, step S130 includes, but is not limited to, steps S131 to S133.

Step S131, according to the dimension of the first coding feature map, feature map jumping is carried out on at least one coding feature map to obtain a first decoding feature map.

In an embodiment, since the size (e.g., dimension) of the feature maps of the layers is the same during the connection in the dense connection manner, the connection can be performed in the channel dimension only if the feature maps of the layers have the same size, so that the feature map size of each layer is adjusted according to the dimension of the first encoding feature map when the first decoding feature map is calculated in this embodiment.

In one embodiment, the feature map F is encoded _encode The method comprises the following steps: first coding feature map F _encode1 And a second encoding characteristic diagram F _encode2 And a third encoding characteristic diagram F _encode3 And a fourth coding feature map F _encode4 And the sizes of the coding feature maps are respectively as follows: 128 × 128 × 256, 64 × 064 × 1512, 32 × 232 × 31024, and 16 × 16 × 2048, the size of the first decoding feature map F1 may be 32 × 32 × 256, the size of the second decoding feature map F2 may be 64 × 64 × 128, and the size of the decoding feature map F3 may be 128 × 128 × 64.

In an embodiment, referring to fig. 4, step S131 includes, but is not limited to, steps S1311 to S1314.

In step S1311, the first coding feature map is downsampled to obtain a first downsampling result.

In one embodiment, the down-sampling is performed by compressing the feature dimensions to a specified size, e.g., half or less of the current input, by one or more convolutional layers, depending on the desired dimension, and then performing the pooling operation. In this embodiment, the first encoding profile F _encode1 The dimensions of (A) are: 128 x 256, the first down-sampling result is denoted as f _down (F _encode1 ) First coding feature map F _encode1 After compression of the characteristic dimensions of more than one convolutional layer, the size may be 32 × 32 × 256, where f _down (. -) represents compressing the signature graph path and down-sampling it.

In step S1312, the second coding feature map is downsampled to obtain a second downsampling result.

In one embodiment, the second encoding profile F _encode2 The dimensions of (A) are: 64 x 512, the second downsampled result being denoted as f _down (F _encode2 ) Second coding feature map F _encode2 After compression of the characteristic dimensions of more than one convolutional layer, the size may be 32 × 32 × 256, where f _down () The compressed signature channel is represented and downsampled.

Step 1313, performing deconvolution on the fourth encoding feature map to obtain a fourth deconvolution result.

In one embodiment, deconvolution is used to combine lowThe dimension characteristic is mapped into high dimension input, which is opposite to the convolution operation, for the deconvolution process, a filter (the parameters of the filter are the same, and the parameter matrix is turned over in the horizontal and vertical directions during deconvolution) after the convolution process is adopted, and the multiplication is performed reversely. In this embodiment, the fourth encoding profile F _encode4 The dimensions of (A) are: 16 × 16 × 2048, and the fourth deconvolution result is denoted as Deconv (F) _encode4 ) Fourth coding feature map F _encode4 After deconvolution, the size may be 32 × 32 × 256, where Deconv (·) represents the deconvolution operation on the feature map.

In step S1314, a first decoding feature map is obtained through calculation.

In an embodiment, each weight parameter in the preset first weight parameter set S1 and the corresponding first downsampling result f are determined _down (F _encode1 ) The third encoding characteristic diagram F _encode3 And a second downsampling result f _down (F _encode2 ) And a fourth deconvolution result Deconv (F) _encode4 ) And respectively multiplying and adding to obtain a first decoding characteristic diagram. The preset first weight parameter set S1 is used to represent the weight corresponding to the jump path in the process of calculating the dense connection of the first decoding feature map F1, and may obtain a better parameter according to the training process, and the process of calculating the first decoding feature map F1 is represented as:

F ₁ ＝Deconv(F _encode4 )+χ ₁ f _down (F _encode1 )+β ₁ f _down (F _encode2 )+α ₁ F _encode3

wherein, the preset first weight parameter set S1 is expressed as {1, α [ ] ₁ ,β ₁ ,χ ₁ It should be understood that, in the preset first weight parameter set S1, each weight corresponds to the above-mentioned first downsampling result f respectively _down (F _encode1 ) The third encoding characteristic diagram F _encode3 A second downsampling result f _down (F _encode2 ) And a fourth deconvolution result Deconv (F) _encode4 ) However, the present embodiment does not specifically limit the order of the weights.

Since dense connections interconnect all layers, each layer accepts as its additional input all layers in front of it, and each layer is connected to all layers in front of it in the channel dimension and used as input for the next layer, in one embodiment, after obtaining the first decoded feature map, it is used as input to compute the second decoded feature map.

Step S132, according to the dimension of the second decoding feature map, feature map jumping is carried out on the first decoding feature map and the at least one coding feature map to obtain a second decoding feature map.

In one embodiment, referring to fig. 5, step S132 includes, but is not limited to, steps S1321 to S1323.

Step S1321, deconvolving the first decoded feature map to obtain a first deconvolution result.

In one embodiment, the size of the first decoding feature map F1 may be 32 × 32 × 256, and the first deconvolution result is denoted as Deconv (F) ₁ ) After the first decoded feature map F1 is deconvolved, the size may be 64 × 64 × 128, where Deconv (·) indicates the deconvolution operation on the feature map.

Step S1322 is to perform upsampling on the third coding feature map to obtain a third upsampling result.

In one embodiment, the upsampling operation upsamples the feature map to a specified size using interpolation, and the third encoded feature map F _encode3 The dimensions of (A) are: 32 x 1024, the third upsampling result is denoted as f _up (F _encode3 ) Third coding feature map F _encode3 After upsampling, the size may be: 64X 128, where f _up (. Cndot.) represents the upsampling operation on the feature map.

And step S1323, calculating to obtain a second decoding feature map.

In one embodiment, each weight parameter in the preset second weight parameter set S2 and the corresponding first deconvolution result Deconv (F) are utilized ₁ ) A second coding feature pattern F _encode2 First down-sampling result f _down (F _encode1 ) And a third upsampling result f _up (F _encode3 ) Multiplying respectively and adding to obtain a second decoding characteristic diagram F2, wherein the first down-sampling result F _down (F _encode1 ) The calculation process of (2) is the same as that in step S1311.

The preset second weight parameter set S2 is used to represent weights corresponding to the jump-over paths in the dense connection process of the second decoding feature map F2, and may obtain a better parameter according to the training process, and the process of obtaining the second decoding feature map F2 by calculation is represented as:

F ₂ ＝Deconv(F ₁ )+β ₂ f _down (F _encode1 )+α ₂ F _encode2 +χ ₂ f _up (F _encode3 )

wherein the preset second weight parameter set S2 is expressed as {1, α [ ] ₂ ,β ₂ ,χ ₂ It should be understood that the weights in the preset second weight parameter set S2 respectively correspond to the above-mentioned first deconvolution result Deconv (F) ₁ ) A second coding feature pattern F _encode2 First down-sampling result f _down (F _encode1 ) And a third upsampling result f _up (F _encode3 ) However, the present embodiment does not specifically limit the order of the weights.

In one embodiment, after the second decoding feature map F2 is obtained, the decoding feature map is calculated using it as input.

And step S133, performing feature map jumping on the second decoding feature map and the at least one coding feature map according to the dimension of the coding feature map to obtain a decoding feature map.

In one embodiment, referring to fig. 6, step S133 includes, but is not limited to, steps S1331 to S1333.

And step S1331, performing deconvolution on the second decoding feature map to obtain a second deconvolution result.

In an embodiment, the size of the second decoding feature map F2 may be 64 × 64 × 128, and the second deconvolution result is denoted as Deconv (F) ₂ ) After the second decoded feature map F2 is deconvolved, the size may be 128 × 128 × 64, where Deconv (·) indicates a deconvolution operation on the feature map.

Step S1332, performing upsampling on the second coding feature map to obtain a second upsampling result.

In one embodiment, the second encoding profile F _encode2 The dimensions of (A) are: 64 x 512, second up-sampling result f _up (F _encode2 ) Second coding feature map F _encode2 After upsampling, the size may be: 128X 64, wherein f _up (. Cndot.) represents an upsampling operation on the feature map.

And step S1333, calculating to obtain a decoding feature map.

In an embodiment, the preset third weight parameter set S3 is utilized to respectively correspond to the second deconvolution result Deconv (F) ₂ ) First encoding characteristic diagram F _encode1 Second up-sampling result f _up (F _encode2 ) And a third upsampled result f _up (F _encode3 ) And multiplying and adding to obtain a decoding feature map F3. Wherein the third up-sampling result f _up (F _encode3 ) The calculation process of (c) is the same as the calculation process in step S1322.

The preset third weight parameter set S3 is used to represent weights corresponding to the jump-over paths in the process of calculating the dense connection of the decoding feature map F3, and may obtain a better parameter according to the training process, and the process of calculating the decoding feature map F3 is represented as:

F ₃ ＝Deconv(F ₂ )+α ₃ F _encode1 +β ₃ f _up (F _encode2 )+χ ₃ f _up (F _encode3 )

wherein the preset third weight parameter set S3 is expressed as {1, α [ ] ₃ ,β ₃ ,χ ₃ It should be understood that the weights in the preset third set S3 of weight parameters respectively correspond to the above-mentioned second deconvolution result Deconv (F) ₂ ) First coding feature map F _encode1 Second up-sampling result f _up (F _encode2 ) And a third upsampled result f _up (F _encode3 ) However, the present embodiment does not specifically limit the order of the weights.

In the embodiment, in the decoding process, the concept of dense connection is adopted to connect each layer and all the previous layers together in the channel dimension, the layers are used as the input of the next layer for calculation layer by layer, the lost features in the encoding and decoding process are reused, a plurality of encoding feature maps obtained in the encoding stage and decoding feature maps obtained in the decoding stage are subjected to jumping connection, wherein the feature maps with the same dimension are directly added, the feature maps with different dimensions are adjusted to the same dimension in an up-sampling, down-sampling or deconvolution mode and then added, and finally, the decoding feature map F3 is obtained in the dense connection mode.

In one embodiment, the decoded feature map F3 is feature enhanced by the following step S140.

Step S140: and performing characteristic pyramid transformation on the decoded characteristic graph by using a characteristic enhancement model to obtain an enhanced characteristic graph.

In an embodiment, the pyramid transformation in the feature enhancement model refers to performing feature enhancement on the decoded feature map by using an FPN network, and the process may be described as follows: and (3) compression-recovery, namely compressing the decoded characteristic diagram to obtain characteristic diagrams with different scales, and recovering the size in a convolution or upsampling mode to obtain an enhanced characteristic diagram. By exchanging information between different layers, allowing communication between different layers, information dissemination is further facilitated and the resulting enhanced feature map enables better detection of salient objects.

Referring to fig. 7, step S140 includes, but is not limited to, steps S141 to S143.

And step S141, compressing the decoding characteristic diagram to obtain at least one compressed characteristic diagram.

In one embodiment, the decoded feature map F3 is convolved and further compressed to obtain three compressed feature maps of different sizes, namely a large, a medium and a small compressed feature map, which are respectively represented as the first compressed feature map F _bottom Second compression characteristic diagram F _middle And a third compression characteristic diagram F _top Wherein the first compression characteristic diagram F _bottom Is the largest size of 128X 256, the second compressed feature map F _middle Second, 64 × 64 × 64, third compressed feature map F _top Minimum, 32 × 32 × 64, the compressed feature map is represented as:

{F _bottom ,F _middle ,F _top }＝Conv(F ₃ )

where Conv (·) denotes a convolution operation.

And step S142, performing convolution on the at least one compressed characteristic map to obtain at least one compressed convolution characteristic map.

In one embodiment, the first compression characteristic maps F are respectively compared _bottom A second compression characteristic diagram F _middle And a third compression characteristic diagram F _top And (4) adjusting the compressed convolution characteristic diagram into a compressed convolution characteristic diagram with the same size in a convolution or up-sampling mode.

The specific process comprises the following steps:

for the first compressed feature map F _bottom Performing convolution operation to obtain a first compressed convolution characteristic diagram F _fpn1 Expressed as:

F _fpn1 ＝Conv(F _bottom )

for the second compression profile F _middle Firstly, after the up-sampling and size adjustment is carried out, the convolution operation is carried out to obtain a second compressed convolution characteristic diagram F _fpn2 In this embodiment, the number of upsampling times is not limited, and the purpose of upsampling is to finally calculate the obtained second compressed convolution feature map F _fpn2 Is compared with the first compressed convolution feature map F _fpn1 Same, e.g., the upsampling multiple may be 2 times, the second compressed convolution feature map F _fpn2 Expressed as:

F _fpn2 ＝Conv(Up _×2 (F _middle ))

wherein, up _×2 (. Cndot.) represents double upsampling, and the number of samplings of 2 in this embodiment is only illustrated, and does not represent a limitation to the upsampling multiple.

For the third compressed feature map F _top Firstly, performing up-sampling and size adjustment, then performing convolution operation, performing up-sampling on the convolution result again, and performing convolution operation on the up-sampling result to obtain a third compressed convolution characteristic diagram F _fpn3 In this embodiment, the number of upsampling times is not limited, and the purpose of upsampling is to finally calculate the obtained third compressed convolution feature map F _fpn3 Characteristic diagram F of convolution with second compression _fpn2 And a first compressed convolution feature map F _fpn1 Same, e.g. the first upsampling multiple mayIs 2 times, the size of the obtained characteristic diagram and a second compression characteristic diagram F _middle And performing up-sampling for the second time with the up-sampling multiple being 2 times, and performing convolution to obtain a final third compressed convolution characteristic diagram F _fpn3 Expressed as:

F _fpn3 ＝Conv(Up _×2 (Conv(Up _×2 (F _top )))

wherein, up _×2 (. Cndot.) denotes double upsampling, and the upsampling multiple of 2 is merely illustrated in this embodiment and does not represent a limitation to the upsampling multiple.

And step S143, aligning and upsampling at least one compressed convolution characteristic diagram according to the dimension of the enhanced characteristic diagram to obtain the enhanced characteristic diagram.

In one embodiment, after three compressed convolution feature maps are obtained, the three compressed convolution feature maps are added element by element, convolution alignment features are utilized, 2 times of upsampling is carried out, compressed convolution feature maps with different scales are upsampled to be uniform in size, subsequent category prediction is carried out, and an enhanced feature map F is obtained _enhance Expressed as:

F _enhance ＝Up _×2 (Conv(F _fpn1 +F _fpn2 +F _fpn3 ))

the enhanced feature map F obtained by the steps _enhance And the target detection is carried out, so that a more accurate result can be obtained, and the problem of poor detection performance when a multi-scale target is detected is further solved. The target in the enhanced feature map is detected in step S150, and it is understood that the target may be one or more targets with different scales, and the number and size of the targets are not limited in this embodiment.

Step S150: and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result.

In one embodiment, for target detection, the detection head prediction model adopts a regression prediction model, including: the target central position regression prediction module, the target central point offset regression prediction module and the target size regression prediction model are respectively used for predicting the central positions of different targets, the central point offsets of the different targets and the sizes of the different targets.

Referring to fig. 8, step S150 includes, but is not limited to, step S151 to step S154.

And step S151, predicting the enhanced feature map by using a target center position regression prediction module to obtain a target center position output value.

In an embodiment, the target central position regression prediction module mainly determines a confidence that a corresponding position in the enhanced feature map is a central point of the target, where the confidence is a probability value that the corresponding position is the central point of the target string, an output value of the target central position output by the target central position regression prediction module is a heat map with a scale of 256 × 256 × 1, a value range of each pixel in the heat map is 0 to 1, and the value represents a probability.

And S152, predicting the enhanced feature graph by using a target center point offset regression prediction module to obtain a target center point offset output value.

In one embodiment, the target center point offset regression prediction module is used to predict the offset distance of each position relative to the center point of the real target, and the offset distance can be represented by a coordinate point, so that the target center point offset output value output by the target center point offset regression prediction module is a 256 × 256 × 2 matrix representing the offset distance coordinate of each position.

And step S153, predicting the enhanced feature graph by using a target size regression prediction model to obtain a target size output value.

In one embodiment, the target size regression prediction model is used to predict the size of the target at each position in the horizontal and vertical directions, and the target size output value output by the target size regression prediction model is a 256 × 256 × 2 matrix representing the size value of each position in the horizontal and vertical directions.

And step S154, obtaining a target detection result.

In an embodiment, after obtaining the target center position output value, the target center point offset output value, and the target size output value output by the three detection head prediction models, firstly, according to a comparison result between the target center position output value and a preset confidence threshold, a target meeting the preset confidence threshold is selected as a detection target, and then, corresponding to the detection target, a value related to a position corresponding to the detection target in the target center point offset output value and the target size output value is selected as a target detection result, that is, the target detection result is the target center position output value, the target center point offset output value, and the target size output value corresponding to the detection target.

For example, in one embodiment, the output of the prediction model of the detector head is:

target center position output value: [ confidence z1 of object 1, confidence z2 of object 2, confidence z3 of object 3 \8230 ], target center point offset output value: [ offset distance size of target 1 (x 1, y 1), offset distance size of target 2 (x 2, y 2), offset distance size of target 3 (x 3, y 3), \ 8230 ], target size output value: [ size of target 1 (c 1, c 1), size of target 2 (c 2, c 2), size of target 3 (c 3, c 4), \ 8230; ].

When the confidence z2 of the target 2 is within the range of the preset confidence threshold, the target 2 is taken as a detection target, and the target detection result at this time is as follows: [ z2, (x 2, y 2), (c 2, c 2) ].

Fig. 9 is a schematic diagram illustrating a principle of an image target detection method according to an embodiment of the present application.

In this embodiment, in fig. 9, an image L to be detected is input in an input layer, the size of the image L is 256 × 256 × 3, and then the encoding stage is entered, an encoding module encodes the input image to be detected to obtain five encoding feature maps, where feature maps sequentially output by different layers are: initial coding feature map F _encode0 First coding feature map F _encode1 A second coding feature pattern F _encode2 And a third encoding characteristic diagram F _encode3 And a fourth coding feature map F _encode4 。

In this embodiment, after the decoding stage, the decoding module obtains the first decoding feature map F1, the second decoding feature map F2, and the decoding feature map F3 according to the coding feature maps in a dense connection manner, where dotted lines between different layers in fig. 9 indicate that the two obtained feature maps have the same size, and a connection line indicates a dense connection process. For example, the first edition in the figureCode feature graph F _encode1 The connection between the first decoding profile F1 and the second decoding profile F2 represents the correlation of the first coding profile F _encode1 Carrying out deconvolution operation; second coding feature map F _encode2 The connection to the first decoding profile F1 represents a down-sampling operation, the second encoding profile F _encode2 The connecting line between the decoding characteristic diagram F3 represents the second coding characteristic diagram F _encode2 Carrying out up-sampling operation; third coding feature map F _encode3 The connecting line with the second decoding feature map F2 and the decoding feature map F3 represents an upsampling operation.

In this embodiment, the decoding feature map F3 is obtained and then enters the feature enhancement stage, and first, the decoding feature map F3 is compressed to obtain a first compressed feature map F _bottom A second compression characteristic diagram F _middle And a third compressed feature map F _top And respectively obtaining a first compressed convolution characteristic diagram F _fpn1 Second compressed convolution feature map F _fpn2 And a third compressed convolution feature map F _fpn3 Adding the compressed convolution characteristic images element by element, utilizing convolution alignment characteristic, carrying out 2 times of upsampling, and upsampling the compressed convolution characteristic images with different scales to uniform size to obtain an enhanced characteristic image F _enhance 。

In this embodiment, the detection stage is paired with the enhanced feature map F _enhance Detection is performed to obtain a target center position output value (heatmap in the figure), a target center point offset output value (offset in the figure), and a target size output value (size in the figure).

According to the description, the image to be detected is obtained and input into the trained image detection model, and the coding module is used for extracting the features of the image to be detected to obtain at least one coding feature map; carrying out dense connection on at least one coding feature map by using a decoding module to obtain a decoding feature map; performing feature pyramid transformation on the decoded feature map by using a feature enhancement model to obtain an enhanced feature map; and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result. In the embodiment, the coding feature maps of the images to be detected are densely connected by using the coding and decoding module to obtain the decoding feature maps, so that the feature extraction capability is enhanced, the feature pyramid transformation is used for enhancing the features of the decoding feature maps, the detection capability of a subsequent detection head prediction model on a multi-scale target is effectively improved, the detection performance of the SAR target image detection algorithm is improved, and the problem of operational capability tension is relieved.

Based on the angle of the anchor frame that needs to be preset in the network processing process, the target detection network can be divided into two categories: anchor-box based and anchor-box-free based target detection networks. The image detection model in the embodiment of the application is a target detection network based on no anchor frame, and compared with the target detection network based on the anchor frame, the image detection model has the advantages of strong generalization, easy calculation and simple network.

An embodiment of the present invention further provides an image detection apparatus, which can implement the image target detection method described above, and with reference to fig. 10, the apparatus includes:

an image obtaining unit 1010 for obtaining an image to be detected and inputting the image to be detected into a trained image detection model, the image detection model comprising: the device comprises a feature extraction model, a feature enhancement model and a detection head prediction model, wherein the feature extraction model comprises: an encoding module and a decoding module;

the encoding unit 1020 is configured to perform feature extraction on an image to be detected by using an encoding module to obtain at least one encoding feature map;

a decoding unit 1030, configured to perform dense connection on at least one encoding feature map by using a decoding module to obtain a decoding feature map;

the feature enhancing unit 1040 is configured to perform feature pyramid transformation on the decoded feature map by using a feature enhancement model to obtain an enhanced feature map;

and the detection unit 1050 is configured to predict the enhanced feature map by using the detection head prediction model to obtain a target detection result.

The specific implementation of the image detection apparatus of this embodiment is substantially the same as the specific implementation of the image target detection method, and is not described herein again.

An embodiment of the present invention further provides an electronic device, including:

at least one memory;

at least one processor;

at least one program;

the programs are stored in a memory, and a processor executes the at least one program to implement the image object detection method of the present invention as described above. The electronic device can be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA for short), a vehicle-mounted computer and the like.

Referring to fig. 11, fig. 11 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 1101 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present invention;

the memory 1102 may be implemented in a ROM (read only memory), a static memory device, a dynamic memory device, or a RAM (random access memory). The memory 1102 may store an operating system and other application programs, and when the technical solution provided in the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 1102, and the processor 1101 is used to call and execute the image target detection method according to the embodiments of the present disclosure;

an input/output interface 1103 for realizing information input and output;

the communication interface 1104 is configured to implement communication interaction between the device and another device, and may implement communication in a wired manner (e.g., USB, network cable, etc.) or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 1105 that transfers information between the various components of the device (e.g., the processor 1101, the memory 1102, the input/output interface 1103, and the communication interface 1104);

wherein the processor 1101, memory 1102, input/output interface 1103, and communication interface 1104 enable communication connections within the device with each other via bus 1105.

An embodiment of the present invention further provides a storage medium, which is a computer-readable storage medium, where computer-executable instructions are stored in the storage medium, and the computer-executable instructions are configured to enable a computer to execute the image object detection method.

According to the image target detection method, the image detection device, the electronic equipment and the storage medium, the image to be detected is obtained and input into the trained image detection model, and the coding module is utilized to extract the features of the image to be detected to obtain at least one coding feature map; carrying out dense connection on at least one coding feature map by using a decoding module to obtain a decoding feature map; performing characteristic pyramid transformation on the decoded characteristic graph by using a characteristic enhancement model to obtain an enhanced characteristic graph; and predicting the enhanced feature map by using a detection head prediction model to obtain a target detection result. In the embodiment, the coding and decoding modules are used for carrying out dense connection on the coding feature maps of the images to be detected to obtain the decoding feature maps, so that the feature extraction capability is enhanced, and then the feature pyramid transformation is used for carrying out feature enhancement on the decoding feature maps, so that the detection capability of a subsequent detection head prediction model on a multi-scale target is effectively improved, the detection performance of the SAR target image detection algorithm is improved, and the problem of insufficient computing capability is solved.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not constitute a limitation to the technical solution provided in the embodiment of the present invention, and it can be known by those skilled in the art that the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems with the evolution of technology and the occurrence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-11 are not intended to limit the embodiments of the present invention, and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the invention and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b and c may be single or plural.

In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present invention have been described above with reference to the accompanying drawings, and are not intended to limit the scope of the embodiments of the invention. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present invention are intended to be within the scope of the claims of the embodiments of the present invention.

Claims

1. An image object detection method, comprising:

acquiring an image to be detected and inputting the image to be detected into a trained image detection model, wherein the image detection model comprises: the device comprises a feature extraction model, a feature enhancement model and a detection head prediction model, wherein the feature extraction model comprises the following components: an encoding module and a decoding module;

utilizing the coding module to perform feature extraction on the image to be detected to obtain at least one coding feature map;

carrying out dense connection on at least one coding feature map by utilizing the decoding module to obtain a decoding feature map;

performing characteristic pyramid transformation on the decoded characteristic graph by using the characteristic enhancement model to obtain an enhanced characteristic graph;

and predicting the enhanced feature map by using the detection head prediction model to obtain a target detection result.

2. The image object detection method according to claim 1, wherein the performing dense connection on at least one of the encoded feature maps by using the decoding module to obtain a decoded feature map comprises:

according to the dimension of a second decoding feature map, performing feature map jumping on the first decoding feature map and at least one coding feature map to obtain the second decoding feature map;

and according to the dimension of the coding feature map, performing feature map jumping on the second decoding feature map and at least one coding feature map to obtain the decoding feature map.

3. The image object detection method of claim 2, wherein at least one of the feature maps comprises: the feature map hopping is performed on at least one coding feature map according to the dimension of the first coding feature map to obtain a first decoding feature map, and the method comprises the following steps:

down-sampling the second coding feature map to obtain a second down-sampling result;

and multiplying the first downsampling result, the third coding feature map, the second downsampling result and the fourth deconvolution result respectively by using each weight parameter in a preset first weight parameter set, and then adding the multiplied products to obtain the first decoding feature map.

4. The image object detection method according to claim 3, wherein the dimension of the second decoded feature map, performing feature map skipping on the first decoded feature map and at least one of the encoded feature maps to obtain the second decoded feature map, comprises:

deconvolving the first decoding feature map to obtain a first deconvolution result;

and multiplying each weight parameter in a preset second weight parameter set by the corresponding first deconvolution result, the second coding feature map, the first down-sampling result and the third up-sampling result respectively, and then adding to obtain the second decoding feature map.

5. The image object detection method according to claim 4, wherein the obtaining of the decoded feature map by performing feature map hopping on the second decoded feature map and at least one of the encoded feature maps according to the dimension of the encoded feature map comprises:

performing upsampling on the second coding feature map to obtain a second upsampling result;

and multiplying each weight parameter in a preset third weight parameter set by the corresponding second deconvolution result, the corresponding first coding feature map, the corresponding second up-sampling result and the corresponding third up-sampling result, and then adding the multiplied results to obtain the decoding feature map.

6. The image object detection method according to claim 1, wherein the performing feature pyramid transformation on the decoded feature map by using the feature enhancement model to obtain an enhanced feature map comprises:

convolving at least one compressed characteristic diagram to obtain at least one compressed convolution characteristic diagram;

and aligning and upsampling at least one compressed convolution characteristic diagram according to the dimension of the enhanced characteristic diagram to obtain the enhanced characteristic diagram.

7. The image object detection method according to any one of claims 1 to 6, wherein the detection head prediction model includes: the method comprises a target central position regression prediction module, a target central point offset regression prediction module and a target size regression prediction model, and the method for predicting the enhanced feature map by using the detection head prediction model to obtain a target detection result comprises the following steps:

predicting the enhanced feature graph by using the target central position regression prediction module to obtain a target central position output value;

predicting the enhanced feature graph by using the target central point offset regression prediction module to obtain a target central point offset output value;

predicting the enhanced feature graph by using the target size regression prediction model to obtain a target size output value;

and obtaining a detection target according to a comparison result of the target central position output value and a preset confidence threshold, wherein the target detection result is the target central position output value, the target central point offset output value and the target size output value corresponding to the detection target.

8. An image detection apparatus, characterized by comprising:

the encoding unit is used for extracting the characteristics of the image to be detected by using the encoding module to obtain at least one encoding characteristic graph;

the decoding unit is used for carrying out dense connection on at least one coding feature map by utilizing the decoding module to obtain a decoding feature map;

the characteristic enhancement unit is used for performing characteristic pyramid transformation on the decoding characteristic graph by utilizing the characteristic enhancement model to obtain an enhanced characteristic graph;

9. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor is configured to execute the image object detection method according to any one of claims 1 to 7 in accordance with the program.

10. A computer-readable storage medium storing computer-executable instructions for performing the image object detection method of any one of claims 1 to 7.