CN110796640A

CN110796640A - Small target defect detection method and device, electronic equipment and storage medium

Info

Publication number: CN110796640A
Application number: CN201910947203.4A
Authority: CN
Inventors: 徐明亮; 吕培; 崔丽莎; 姜晓恒; 张晨民; 闫杰; 李丙涛; 王明纲
Original assignee: Zhengzhou Jinhui Computer System Engineering Co Ltd
Current assignee: Zhengzhou Jinhui Computer System Engineering Co Ltd
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-02-14

Abstract

The invention relates to the technical field of small target defect detection, in particular to a small target defect detection method, a small target defect detection device, electronic equipment and a storage medium, wherein the detection method comprises the following steps: forward propagation is carried out on an image to be detected by utilizing a convolutional neural network to extract features, and a plurality of high-level feature maps and bottom-level feature maps which store small target information are obtained; fusing a plurality of high-level feature maps and corresponding bottom-level feature maps in a jumping fusion mode to obtain a fusion layer; predicting a class confidence and a coordinate offset of a predicted bounding box on each of the fused layers; and screening the prediction boundary box by a non-maximum value inhibition method to obtain a target prediction box. According to the embodiment of the invention, various abundant complementary information extracted by different feature maps in the convolutional neural network can be fully utilized by fusing the high-level feature map and the bottom-level feature map and performing multi-scale prediction, so that the precision of small target detection is improved.

Description

Small target defect detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of small target defect detection, in particular to a small target defect detection method and device, electronic equipment and a storage medium.

Background

Workpiece Defect detection (Defect detection) is an important component of industrial manufacturing and also an important guarantee for the value of workpieces. Traditional work piece detects mainly relies on artifical the observation, and it is great to receive subjective consciousness influence, and efficiency is lower moreover. Especially, some tiny defects can not be detected accurately on line by human eyes.

In recent years, with the rapid development of powerful convolutional neural networks, the performance of target detection is also greatly improved. The current target detection method mainly comprises two types: one-stage detectors (One-stage detectors) and Two-stage detectors (Two-stage detectors). The two-stage detector (such as fast RCNN, FPN, Mask RCNN and the like) mainly comprises two steps of candidate frame extraction and classification positioning, and has high detection precision but cannot meet the real-time requirement. In order to increase the detection speed, a one-stage detector (such as SSD, YOLO series) directly predicts the bounding box on the feature map and performs classification and position regression.

In practice, the inventors found that the above prior art has the following disadvantages:

defects smaller than 32 × 32 pixels are generally referred to as small target defects. Aiming at the small target defect, because the small target occupies less pixels and information in the image, the method is used for detecting the small target defect, and the detection precision is lower.

Disclosure of Invention

In order to solve the above technical problems, an object of the present invention is to provide a method, an apparatus, an electronic device and a storage medium for detecting small target defects, wherein the adopted technical scheme is as follows:

in a first aspect, a method for detecting small target defects includes the steps of:

forward propagation is carried out on an image to be detected by utilizing a convolutional neural network to extract features, and a plurality of high-level feature maps and bottom-level feature maps which store small target information are obtained;

fusing a plurality of high-level feature maps and corresponding bottom-level feature maps in a jumping fusion mode to obtain a fusion layer;

predicting a class confidence and a coordinate offset of a predicted bounding box on each of the fused layers;

and screening the prediction boundary box by a non-maximum value inhibition method to obtain a target prediction box.

Further, the method for fusing the high-level feature map and the corresponding bottom-level feature map to obtain a fused layer comprises the following steps:

performing up-sampling on the high-level feature map by three times of deconvolution operation to obtain a deconvolution image; the deconvolution operation comprises a deconvolution layer, a convolution layer and an activation layer;

carrying out deconvolution operation, and simultaneously, passing the bottom layer feature map through a convolution layer and an activation layer to obtain a convolution image; the deconvolution image and the convolution image have the same length, width and channel number;

and fusing the deconvolution image and the convolution image, and obtaining the fused layer through a convolution layer and an activation layer.

Further, the jump type fusion mode is to fuse the high-level feature map and the bottom-level feature map with a resolution N times that of the high-level feature map, where N is greater than 2.

Further, the predicting predicts the class confidence and the coordinate offset of the bounding box on the plurality of high-level feature maps at the same time as the class confidence and the coordinate offset of the bounding box of each of the fusion levels.

In a second aspect, an apparatus for detecting small target defects, the apparatus comprising:

the characteristic extraction module is used for carrying out forward propagation on the image to be detected by utilizing the convolutional neural network to extract characteristics so as to obtain a plurality of high-level characteristic graphs and bottom-level characteristic graphs which store small target information;

the fusion module is used for fusing the high-level feature maps and the corresponding bottom-level feature maps in a jump fusion mode to obtain a fusion layer;

a prediction bounding box module for predicting class confidence and coordinate offset of the prediction bounding box on each of the fusion layers; and

and the screening module is used for screening the prediction boundary box by a non-maximum value inhibition method to obtain a target prediction box.

Further, the fusion module includes:

the high-level feature map processing module is used for performing up-sampling on the high-level feature map through three times of deconvolution operation to obtain a deconvolution image; the deconvolution operation comprises a deconvolution layer, a convolution layer and an activation layer;

the bottom layer feature map processing module is used for carrying out deconvolution operation and simultaneously enabling the bottom layer feature map to pass through a convolution layer and an activation layer to obtain a convolution image; the deconvolution image and the convolution image have the same length, width and channel number;

and the sub-fusion module is used for fusing the deconvolution image with the convolution image, and obtaining the fusion layer through a convolution layer and an activation layer.

Further, the jump type fusion mode adopted by the fusion module is to fuse the high-level feature map and the bottom-level feature map with the resolution N times that of the high-level feature map, wherein N is greater than 2.

Further, the prediction bounding box module further comprises a sub-bounding box prediction module for predicting the class confidence and the coordinate offset of the bounding box on the plurality of high-level feature maps.

In a third aspect, an electronic device includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: any of the above-described small target defect detection methods is performed.

In a fourth aspect, a storage medium having computer-readable program instructions stored therein is provided, wherein the program instructions, when executed by a processor, implement any of the above-mentioned small target defect detection methods.

The invention has the following beneficial effects:

the embodiment of the invention provides a method for detecting small target defects, which utilizes a convolutional neural network to forward propagate an image to be detected and extract characteristics to obtain a plurality of high-level characteristic diagrams and bottom-level characteristic diagrams which store small target information; fusing each high-level feature map and the corresponding bottom-level feature map in a jump fusion mode to obtain a fusion layer; simultaneously predicting the category confidence coefficient and the coordinate offset of the bounding box on the fusion layer characteristic graph and the high-layer characteristic graph; and screening the boundary box by a non-maximum value inhibition method to obtain a target prediction box. According to the embodiment of the invention, various abundant complementary information extracted by different feature maps in the convolutional neural network can be fully utilized by fusing the high-level feature map and the bottom-level feature map and performing multi-scale prediction, so that the precision of small target detection is improved.

Drawings

FIG. 1 is a flowchart of a method for detecting a small target defect according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the effect of feature maps of different resolutions on the detection accuracy of small objects according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an overall network architecture of another embodiment of the present invention;

FIG. 4 is a flowchart of a method for fusion, according to an embodiment of the present invention;

FIG. 5 is a network structure diagram of a convergence module according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating the generation of a prediction bounding box according to an embodiment of the present invention;

FIG. 7 is a diagram of the detection results on a data set DAGM according to an embodiment of the present invention;

FIG. 8 is a graph of the detection result of the embodiment of the present invention on the Magnetic-Tile data set;

FIG. 9 is a block diagram of an apparatus for detecting small target defects according to another embodiment of the present invention;

fig. 10 is a block diagram of an electronic device according to another embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description of the method, apparatus, electronic device and storage medium for detecting small object defects according to the present invention with reference to the accompanying drawings and preferred embodiments, and the detailed description thereof, the structure, features and effects thereof, are provided below. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The following describes specific schemes of a method, an apparatus, an electronic device and a storage medium for detecting small target defects provided by the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, it shows a flowchart of a method for detecting a small target defect according to an embodiment of the present invention, in order to solve the technical problem of improving the detection accuracy, the method fuses a high-level feature map and a bottom-level feature map with large resolution difference in a jump-type fusion manner, and extracts rich complementary information of a small target. Specifically, the detection method comprises the following steps:

and S001, forward propagation is carried out on the image to be detected by utilizing a convolutional neural network to extract features, so that a plurality of high-level feature maps and bottom-level feature maps in which small target information is stored are obtained.

The resolution of the bottom layer feature map is high, and the bottom layer feature map contains rich detail information, such as edge information, texture information and the like. The small target information in the high-level feature map is reduced or even disappears along with the increase of the down-sampling times. Specifically, in the embodiment of the present invention, the convolutional neural network based on SSD performs forward propagation to extract features.

Referring to fig. 2, which is a schematic diagram illustrating the effect of feature maps with different resolutions on the small target detection accuracy provided by the embodiment of the present invention, taking the process of SSD512 and SSD300 model forward propagation as an example, as can be seen from fig. 2, the detection accuracy is greatly reduced on the weak small defect data set in the workpiece as the feature maps become smaller, and the response to the small target is completely lost at conv11 and conv12, and the mAP value is 0. Therefore, the bottom-layer feature map is more suitable for detecting small targets due to the fact that the bottom-layer feature map contains rich detail information; and because the high-level characteristic graph is subjected to down sampling for many times, the information of the small target is continuously reduced until the small target disappears completely, and the high-level characteristic graph is more suitable for detecting the medium target and the large target.

And S002, fusing the plurality of high-level feature maps and the corresponding bottom-level feature map in a jump-type fusion mode to obtain a fusion layer.

Because the difference of adjacent feature maps is small and the extracted features are similar, the embodiment of the invention adopts a jump fusion mode, and because the resolution difference between the fused high-level feature map and the fused bottom-level feature map is large and the corresponding feature maps have rich complementary information, the detection precision of the small target can be improved after fusion.

The fused high-level feature map and the bottom-level feature map have the same spatial resolution and channel number.

And step S003, the category confidence and the coordinate offset of the bounding box are simultaneously predicted on the fusion layer feature map and the high-layer feature map.

The fusion layer has the detail information of the bottom layer characteristic diagram and the semantic information of the high layer characteristic diagram, and is responsible for detecting the small target. Therefore, the small target is detected by predicting the class confidence and the coordinate offset of the bounding box of each fusion layer.

And step S004, screening the boundary frame by a non-maximum value inhibition method to obtain a target prediction frame.

In summary, the embodiment of the present invention provides a method for detecting a small target defect, where the method uses a convolutional neural network to forward propagate an image to be detected to extract features, so as to obtain a plurality of high-level feature maps and bottom-level feature maps in which small target information is stored; fusing a plurality of high-level feature maps and corresponding bottom-level feature maps in a jumping fusion mode to obtain a fusion layer; simultaneously predicting the category confidence coefficient and the coordinate offset of the bounding box on the fusion layer characteristic graph and the high-layer characteristic graph; and screening the boundary box by a non-maximum value inhibition method to obtain a target prediction box. Various abundant complementary information extracted from different feature maps in the convolutional neural network can be fully utilized by fusing the high-level feature map and the bottom-level feature map and performing multi-scale prediction, so that the precision of small target detection is improved.

As a preferred embodiment of the present invention, since the high-level feature map is responsible for detecting the medium target and the large target, in order to achieve the purpose of being able to detect the small target, the medium target, and the large target at the same time, the embodiment of the present invention predicts the class confidence and the coordinate offset of the bounding box on each high-level feature map at the same time as predicting the class confidence and the coordinate offset of the bounding box on each fusion level in step S003.

Referring to fig. 3, which shows a schematic diagram of an overall network structure according to another embodiment of the present invention, taking a 300 × 300 image to be detected as an example of inputting into a convolutional neural network, the image to be detected is subjected to feature extraction in one pass through Conv 1-Conv 11, where feature maps extracted by Conv 1-Conv 7 are bottom-level feature maps, and feature maps extracted by Conv 8-Conv 11 are high-level feature maps. Specifically, the width, height, and number of channels of each feature map are 300 × 300 × 64 for Conv1, 150 × 150 × 128 for Conv2, 75 × 75 × 256 for Conv3, 38 × 38 × 512 for Conv4, 19 × 19 × 512 for Conv5, 19 × 19 × 1024 for Conv6, 19 × 19 × 1024 for Conv7, 10 × 10 × 512 for Conv8, 5 × 5 × 256 for Conv9, 3 × 3 × 256 for Conv10, and 1 × 1 × 256 for Conv11, respectively.

Conv3 and Conv8 were fused to a fusion layer Module1, Conv4 and Conv9 were fused to a fusion layer Module2, and Conv7 and Conv10 were fused to a fusion layer Module3, respectively. The two layers of feature maps fused in the embodiment of the invention have larger difference of spatial resolution, which is about 8 times of difference, but not the fusion of the connected layers, and the jump type fusion mode can extract rich complementary information of weak and small targets, thereby improving the precision between detections.

And simultaneously predicting the category confidence and the coordinate offset of the bounding box on the fusion layer characteristic graph and the high-layer characteristic graph. The feature maps of 7 different scales include: 3 fused layers: module1, Module2, and Module3, and 4 high-level feature maps: conv8, conv9, conv10 and conv 11. The target object is predicted on the 7 feature maps of different scales. The 3 fusion layers have detail information of the bottom layer feature diagram and semantic information of the high layer feature diagram at the same time, so that the fusion layers are responsible for detecting small targets; and the high-level feature map is responsible for detecting medium and large targets. And predicting the bounding boxes with different scales and proportions by taking each pixel point of the feature map as a center. Specifically, the detection scales of the 7 feature maps are {30, 60, 111, 162, 213, 264, 315}, and the ratios are { {1:1,1:2,2:1}, {1:1,1:2,2:1,1:3,3:1}, {1:1,1:2,2:1,1:3,3:1}, {1:1,1:2,2:1,1:3,3:1}, {1:1,1:2,2:1,1:3,3:1}, {1:1,1:2,2:1}, and {1:1,1:2,2:1} }. Each 1:1 bounding box has two different scales, so the number of predicted bounding boxes per position on the 7 feature maps is {4, 6, 6, 6, 6, 4, 4}, respectively. As shown in fig. 6, fig. 6 is a schematic diagram of 4 bounding boxes with different scales and proportions generated at each position on the 4 × 4 feature map, and the 4 × 4 × 4 bounding boxes with different sizes are generated on the whole feature map.

As a preferred embodiment of the present invention, in step S002, a method for fusing each high-level feature map and the corresponding bottom-level feature map by a jump-fusion manner to obtain a fusion layer is provided, as shown in fig. 4 and 5, and the fusion method includes the following steps:

step 401, performing up-sampling on the high-level feature map through three times of deconvolution operation to obtain a deconvolution image; the deconvolution operation includes an deconvolution layer, a convolution layer, and an activation layer.

For the two-layer feature map to be subjected to pixel-level fusion, the two-layer feature map needs to have the same spatial resolution and channel number. Taking the overall network structure shown in fig. 3 as an example, Conv8 and Conv3, Conv9 and Conv4, Conv10 and Conv7 have the same downsampling factor 8, and symmetric connection enables them to share the structure of the fusion module, reducing the computational complexity.

Referring to fig. 5 again, taking the fusion layer Module1 as an example, Conv8 performs up-sampling by 3 deconvolution operations, and the feature map becomes 2 times the original feature map after each deconvolution operation. The deconvolution operation consists of a deconvolution layer (Deconv), a convolution layer (Conv) and an activation layer (Relu), resulting in a 75 × 75 × 256 deconvolution image.

Step 402, while carrying out deconvolution operation, passing the bottom layer feature map through a convolution layer and an activation layer to obtain a convolution image; the deconvolved image has the same length, width, and number of channels as the convolved image.

Referring again to fig. 5, Conv3 represents a convolution image obtained after passing through a convolution layer and normalization layer (Norm) of step size 1.

And 403, fusing the deconvolution image with the convolution image, and obtaining a fused layer through a convolution layer and an activation layer.

The deconvolution image obtained in step 401 and the convolution image obtained in step 402 are subjected to pixel-level fusion (Eltw Sum), and then passed through a convolution layer and an activation layer to obtain a fusion layer Module1, which has a size of 75 × 75 × 256.

Similarly, the Module2 and the Module3 can be obtained according to the steps 401 to 403.

The convergence module is slightly different in different network architectures. For example, in the network structure of the SSD512 and the network structure of the SSD300 using the detection method provided by the embodiment of the present invention, table 1 shows the structure information of the two models, which is specifically as follows:

TABLE 1 Difference in the fusion modules of the network architecture of SSD512 and that of SSD300

As another preferred embodiment of the present invention, the embodiment of the present invention further includes a data augmentation step, before forward propagation of the image to be detected by using the convolutional neural network to extract features, in order to make the model more robust, it is necessary to use inputs of different sizes and shapes, and perform random sampling on the data in the following manner:

in the first step, the entire picture is used.

In the second step, sub-blocks of 0.1, 0.3, 0.5, 0.7 and 0.9 of the IOU and the target object are used, with corresponding aspect ratios between [1/2,2] between the original size [0.1,1 ].

And thirdly, randomly taking a sub-block, and keeping the overlapping part when the center of the true value frame is in the sampled sub-block.

After these sampling steps, each sampled sub-block is adjusted to a fixed size and flipped at a random level with a probability of 0.5.

As another preferred embodiment of the present invention, a positive and negative sample strategy is adopted for the predicted bounding box. This unbalanced data structure can severely affect the performance of the model, since the number of negative samples in any one graph far exceeds the number of positive samples, and therefore the bounding box is chosen. Given an input image and a true value (groudtruth) of each object, the largest IOU in the prediction bounding box (bounding box) corresponding to each true value box is first found as a positive sample. The IOU calculation formula is:

then, those IOUs with any one true value box greater than 0.5 are found in the remaining prediction bounding boxes as positive samples. The proportion of positive and negative samples is kept at 1:3 by a method of difficult sample mining, so that the training speed of the model is greatly increased, and the testing precision is improved.

As another preferred embodiment of the present invention, to further optimize the training process, a loss function is defined after finding the true value of the match (including object and background) for each bounding box. And marking a label for each bounding box, and inputting the bounding boxes into a network structure for training. The overall loss function is:

wherein N is the number of the bounding boxes,

in order to be a loss of confidence,

for positioning loss, a is a cross-over ratio (jaccardoverlap) coefficient, k is a target category, b is a prediction frame, g is a true value frame, p is a classification confidence coefficient, w is the width of the true value frame, h is the height of the true value frame, and cx and cy are central point coordinates of the frame.

Referring to FIG. 7, the result of the detection of the present invention on a DAGM defect data set is shown, where the DAGM data set includes 10 types of defect samples: class 1-Class 10, and the target defects are small. Three pictures are selected from each type of defect sample as detection objects, and the detection is performed by using the detection method provided by the embodiment of the invention, wherein the detection result is shown in fig. 7. The weak defects detected in fig. 7 are marked by the bounding box, and the classification category and the classification confidence are displayed above the bounding box. In addition, referring to fig. 8, the results of the present invention on another defect data set, Magnetic-Tile, containing 5 types of defects, respectively, blowhole, break, crack, ray, uneven, are shown. Six pictures are selected from each type of defect sample to be used as detection objects, the detection method provided by the embodiment of the invention is used for detection, the detection result is shown as figure 8, weak and small defects detected in figure 8 are marked by a boundary frame, and classification types and classification confidence degrees are displayed above the boundary frame. Therefore, the invention can detect the defects with different sizes simultaneously and display the defect types and classification confidence degrees.

Please refer to fig. 9, which shows a block diagram of a small target defect detection apparatus according to another embodiment of the present invention, where the detection apparatus includes a feature extraction module 901, a fusion module 902, a prediction bounding box module 903 and a screening module 904, specifically, the feature extraction module 901 is configured to forward propagate an image to be detected by using a convolutional neural network to extract features, so as to obtain a plurality of high-level feature maps and bottom-level feature maps in which small target information is stored; the fusion module 902 is configured to fuse each high-level feature map and the corresponding bottom-level feature map in a jump fusion manner to obtain a fusion layer; the predicted bounding box module 903 is used to predict the class confidence and coordinate offset of the predicted bounding box at each fused layer; and the screening module 904 is configured to screen the predicted bounding box by a non-maximum suppression method to obtain a target predicted box.

Preferably, referring to fig. 9 again, the fusion module 902 includes a high-level feature map processing module 91, a bottom-level feature map processing module 92, and a sub-fusion module 93, specifically, the high-level feature map processing module 91 is configured to perform up-sampling on the high-level feature map through three times of deconvolution operations to obtain a deconvolution image; the deconvolution operation comprises a deconvolution layer, a convolution layer and an activation layer; the bottom layer feature map processing module 92 is configured to, while performing deconvolution operation, obtain a convolution image by passing the bottom layer feature map through a convolution layer and an activation layer; the deconvolution image and the convolution image have the same length, width and channel number; the sub-fusion module 93 is configured to fuse the deconvolution image with the convolution image, and obtain a fusion layer through a convolution layer and an activation layer.

Preferably, the jump fusion mode adopted by the fusion module is to fuse the high-level feature map and the bottom-level feature map with the resolution N times higher than that of the high-level feature map, wherein N is greater than 2.

Preferably, the predicted bounding box module further comprises a sub-bounding box prediction module for predicting class confidence and coordinate offset of the bounding box on each high-level feature map.

Referring to fig. 10, a block diagram of an electronic device according to another embodiment of the present invention is shown, including a memory 101 and a processor 102, where:

the memory 101 is used to store instructions required by the processor 102 to perform tasks.

The processor 102 is configured to execute an instruction stored in the memory 101, forward propagate an image to be detected by using a convolutional neural network, and extract features to obtain a plurality of high-level feature maps and bottom-level feature maps in which small target information is stored; fusing each high-level feature map and the corresponding bottom-level feature map in a jump fusion mode to obtain a fusion layer; predicting a class confidence and a coordinate offset of the predicted bounding box on each fused layer; and screening the prediction boundary box by a non-maximum value inhibition method to obtain a target prediction box.

In other embodiments, the electronic device further comprises a communication interface 103 for performing communication of the subject with other devices or a communication network.

Preferably, the processor 102 is configured to execute the instructions stored in the memory 101, and when performing the monitoring, perform a method for detecting a small target defect provided in any of the above embodiments.

The embodiment of the invention also provides a storage medium, wherein the storage medium can store a program readable by a computer, and the program executes the method for detecting the small target defect provided by any one of the above embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for detecting defects in small objects, the method comprising the steps of:

2. The method for detecting the small target defect according to claim 1, wherein the method for fusing the high-level feature map and the corresponding bottom-level feature map to obtain a fused layer comprises the following steps:

3. The method as claimed in claim 1 or 2, wherein the jump-fusion is performed by fusing the high-level feature map and the low-level feature map with a resolution N times that of the high-level feature map, where N is greater than 2.

4. The method for detecting the small target defect according to claim 1 or 2, wherein the predicting predicts the class confidence and the coordinate offset of the bounding box on a plurality of high-level feature maps at the same time as the class confidence and the coordinate offset of the bounding box of each fusion level.

5. A device for detecting defects in small objects, the device comprising:

6. The apparatus for detecting small target defects according to claim 5, wherein the fusion module comprises:

7. The apparatus according to claim 5 or 6, wherein the fusion module employs a jump fusion mode to fuse the high-level feature map and the low-level feature map with a resolution N times that of the high-level feature map, where N is greater than 2.

8. The method of claim 5, wherein the predicted bounding box module further comprises a sub-bounding box prediction module for predicting class confidence and coordinate offset of the bounding box on the plurality of high-level feature maps.

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 4.

10. A storage medium having computer-readable program instructions stored therein, which when executed by a processor implement the method of any one of claims 1 to 4.