CN111079685A

CN111079685A - 3D target detection method

Info

Publication number: CN111079685A
Application number: CN201911354155.4A
Authority: CN
Inventors: 王正宁; 吕侠; 何庆东; 赵德明; 张翔; 蓝先迪
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-04-28
Anticipated expiration: 2039-12-25
Also published as: CN111079685B

Abstract

The invention provides a 3D target detection method, which comprises the steps of firstly, carrying out feature extraction on a point cloud aerial view, a target image to be detected and a front view; obtaining a 3D target suggestion frame of the aerial view and the target image to be detected, fusing the feature maps obtained by fusing the aerial view, the target image to be detected and the front view respectively, and performing total feature map fusion by pixel-by-pixel addition averaging to obtain a final feature map fusion result; and projecting the 3D target suggestion frame on the finally fused feature map to form a 2D target suggestion frame to obtain ROI regional features corresponding to the 2D target suggestion frame, and finally carrying out classification and regression on the ROI regional features to obtain a final 3D target detection candidate frame. The invention effectively improves the detection and positioning performance of the detection network on different targets of interest in the 3D space in different environments, and solves the problem of poor detection of pedestrians and vehicles due to point cloud sparsity in a road environment.

Description

3D target detection method

Technical Field

The invention belongs to the field of image processing and computer vision, and particularly relates to a 3D target detection method.

Background

With the vigorous development of artificial intelligence technology, an intelligent automobile with an advanced assistant driving system ADAS and an unmanned technology as the core becomes the development direction of the automobile in the future, and 3D target detection as one of key technologies is always a research hotspot in the field.

For 3D target detection, there are currently three main approaches: the first method is monocular image object detection based on monocular RGB images, such as the monocular image object detection method proposed by x.chen et al (Chen X, Kundu K, Zhang Z, et al, simple 3D object detection for autonomus Driving [ C ]//2016IEEE Conference on Computer vision and pat ern registration (CVPR). IEEE,2016), which focuses on object shape prior, context features, and instance segmentation of the monocular image to generate a 3D object suggestion box, which inevitably lacks accurate depth information due to being a monocular image. The second method is a binocular RGB image-based 3D Object Detection method like the 3DOP Object Detection method proposed also by x.chen et al (Chen X, Kundu K, Zhu Y, et al, 3D Object projection programs u sing Stereo image for Accurate Object Class Detection [ J ]. ieee transactions on Pattern a sources and Machine Intelligence,2017:1-1) which generates a 3D Object suggestion box by encoding Object prior size, depth information (e.g. free space, point cloud density) into an energy function, and then regresses the 3D suggestion box by the R-CNN method. Currently, only a few methods use stereo vision for 3D object detection. The third method is 3D target detection based on LiDAR point clouds, and most of the most advanced 3D target detection methods rely on LiDAR data to provide accurate 3D information, however, the way of processing point clouds differs between different detection methods. The F-PointNet method, as proposed by C.R.Qi et al (C.R.Qi, W.Liu, C.Wu, H.Su, andL.J.Guibas, "Frustum pointnets for 3D object detection From rgb-D data," arXivpreprint arXiv:1711.08488,2017), takes the original point cloud as input and then locates the 3D object based on the 2D object detection and the truncated point cloud region predicted by the PointNet network. Fast object detection methods, as proposed by m.engelcke et al (m.engelcke, d.rao, d.z.wang, c.h.tung, and i.p. net. voice 3deep: Fast object detection In 3D point groups using information related to a local network, information and Automation (ICRA),2017IE EE International Conference on pages 1355-1361. IEEE,2017.1,2), use a structured voxel grid to quantify the point cloud raw data, and then use 2D or 3D CNN to detect 3D objects. The MV3D target detection method proposed by h.ma et al (x.chen, h.ma, j.wan, b.li, andt.xia, "Multi-view 3D object detection on network for Autonomous driving," ieee CVPR,2017) projects a point cloud into a 2D bird's eye view or front view, and then performs convolution processing using a Convolutional Neural Network (CNN), and in the process, RGB image data is also combined to obtain more dense information. The 3D target detection based on point cloud data proposed by the pommel et al (pommel, dragon. a kind of 3D target detection method based on point cloud data. patent No. 201811371861.5) also projects the point cloud data onto the 2D bird's eye view, extracts the point cloud data characteristics through the ASPP network, and generates the candidate target position in the 3D space.

Disclosure of Invention

Aiming at the problems, the invention provides a 3D target detection method, which is realized by adopting a deep convolutional network and specifically comprises the following steps:

(1) preparing a data set to be processed, wherein the data set comprises an original road vehicle RGB image data set and a corresponding target point cloud data set;

(2) dividing the original road vehicle RGB image data set and the corresponding target point cloud data set in the step (1) into a training set, a verification set and a test set, wherein the verification set and the test set are used for testing the detection performance of the deep convolutional network after the deep convolutional network is trained;

(3) projecting the point cloud data in the training set in the step (2) to a two-dimensional plane to obtain a bird's-eye view;

(4) projecting the point cloud data in the training set in the step (2) to a cylindrical plane to obtain a front view;

(5) performing feature extraction on the aerial view in the step (3) by using a feature extraction network, and respectively obtaining feature maps FB after passing through convolution layers₁，FB₂In which FB₁Is of size FB₂2 times of the total weight of the composition;

(6) the characteristic diagram FB obtained in the step (5)₂Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FB₃；

(7) The characteristic diagram FB obtained in the steps (5) and (6)₃And FB₁Summing, and performing fusion processing to obtain a feature diagram FB;

(8) the feature map FB is used for realizing a task of extracting a target candidate position in the aerial view through a two-dimensional image target detection method, so that a candidate position of the target in the aerial view is generated;

(9) performing feature extraction on the RGB images of the original road vehicles in the training set in the step (2) by using a feature extraction network, and respectively obtaining feature maps FG after convolution layers₁，FG₂In which FG₁Is FG in size₂2 times of the total weight of the composition;

(10) the characteristic diagram FG in the step (9)₂Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FG₃；

(11) Subjecting the characteristic maps FG obtained in steps (9) and (10)₁And FG₃Summing, performing fusion processing to obtain a feature map FG, and generating candidate positions of targets in an original road vehicle RGB image by a two-dimensional image target detection method;

(12) obtaining the positions of the target candidate frames in the bird's eye view image and the RGB image of the original road vehicle in the step (8) and the step (11), respectively, generating sub-networks through one 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;

(13) performing feature extraction on the front view in the step (4) by using a feature extraction network, and respectively obtaining feature maps FF after the front view passes through the convolutional layer₁，FF₂Wherein FF₁Has a size of FF₂2 times of the total weight of the composition;

(14) the characteristic diagram FF obtained in the step (13) is₂Performing a 2-fold upsampling process, i.e. a deconvolution operation, to obtain a feature map FF₃；

(15) The characteristic diagrams FF obtained in the steps (13) and (14) are processed₃And FF₁Summing, and performing fusion processing to obtain a feature map FF;

(16) fusing the feature maps FB, FG and FF obtained in the steps (7), (11) and (15) by an element average value method to obtain a feature map F;

(17) projecting the 3D target suggestion frame obtained in the step (12) on a feature map F to form a 2D target suggestion frame, and obtaining an ROI (region of interest) feature map corresponding to the 2D target suggestion frame;

(18) after passing through the ROI area characteristic diagram obtained in the step (17) twice through the full connection layer, carrying out target classification and regression to obtain a final 3D target detection frame, namely, the final detection result of the 3D target detection is a 3D frame framed on a target;

therefore, the training set completes the training of the deep convolutional network, then the optimal deep convolutional network training model is selected by adopting the verification set, and the test set is used for completing the later-stage performance test of the selected optimal deep convolutional network training model or the performance test is used in practical application, so that the 3D target detection method based on data fusion is realized.

In the 2D target detection, the SPP network is used for replacing simple pooling operation, the SPP network is more robust than simple pooling, and the SPP network improves the scale invariance of images, is easier to converge and improves the accuracy of experiments due to the fact that the SPP network processes feature areas with different aspect ratios and different sizes. In addition, the point cloud data (the point cloud aerial view and the point cloud front view) and the RGB image data are subjected to multi-mode fusion, the sparsity of the point cloud data is compensated by using texture information dense in RGB images in positioning, firstly, the 2D target detection results of the point cloud aerial view and the RGB images are subjected to space matching to obtain a rough 3D target candidate frame, then, the characteristic information of the point cloud front view is extracted, multi-view characteristic fusion is carried out, the missing spatial information is compensated, the 3D target candidate frame is further refined, and therefore a good detection effect is achieved. By adopting the method, the depth information (point cloud data) of the target in the scene can be more accurately acquired through the laser radar, so that the spatial position information of the target of interest can be roughly acquired in a three-dimensional space, and then dense texture information provided by an RGB image is added to perform multi-mode data fusion, so that the detection and positioning performance of the detection network on different targets of interest in a 3D space in different environments is effectively improved, and the problem of poor detection of pedestrians and vehicles due to the point cloud sparsity in a road environment can be solved.

Drawings

FIG. 1 is a diagram of the deep convolutional network structure of the present invention

FIG. 2 is a network structure of a two-dimensional image target detection method of the present invention

FIG. 3 is a diagram of the SPP network structure of the present invention

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The invention firstly provides a feature extraction method, which is used for extracting features of a point cloud aerial view, a point cloud front view and a target image to be detected. And then, generating a sub-network through the 3D candidate frame by using the feature map extracted from each view, and generating a 3D candidate suggestion frame by using a space matching method. And finally, providing a data fusion mode, fusing the multi-mode data, and classifying and regressing the target frame. The network structure of the whole invention is shown in fig. 1.

In the structure of the deep convolutional network shown in fig. 1, three branches are included from top to bottom, wherein the input image in the first row of branches is a bird's-eye view image obtained by projecting point cloud data onto a two-dimensional plane; the input map in the second row of branches is an unprocessed raw road vehicle RGB image; the input image in the third row of branches is a front view obtained by projecting point cloud data to a cylindrical plane; the input graphs in the three branches need to be processed by a VGG16 feature extraction network, the VGG16 feature extraction network comprises 13 convolutional layers and 4 pooling layers in total, but does not comprise full-connection layers, and the 4 pooling layers divide the convolutional layers of the VGG16 feature extraction network into 5 groups, wherein the 1 st group comprises 2 convolutional layers with parameter representations of all Conv 3-64; group 2 included 2 convolutional layers with parametric representation all Conv 3-128; group 3 included convolutional layers with 3 parametric representations all Conv 3-256; group 4 included convolutional layers with 3 parametric representations all Conv 3-512; the 5 th group comprises 3 convolutional layers with the parameter representation of Conv3-512, the maximum pooling treatment is carried out by adopting 1 same pooling layer after each group of convolutional layers, the size of the pooling core for carrying out the maximum pooling is 2 multiplied by 2, and the step length is 2; the parametric representation Conv3-64 indicates that the current layer convolution kernel size is 3 × 3, and the output is 64 channels; conv3-128 indicates that the current layer convolution kernel size is 3 × 3, with an output of 128 channels; conv3-256 indicates that the current layer convolution kernel size is 3 × 3, with 256 channels in output; conv3-512 indicates that the current layer convolution kernel size is 3 × 3, and the output is 512 channels; the sizes of convolution kernels in the VGG16 feature extraction network are uniformly set to be 3 x 3, and the step length is 1;

block1 (same as Block1 in the second and third row branches) in the first row branch of the network structure shown in fig. 1 is shown in table 1 below, where table 1 corresponds to Block1 in fig. 1, there are 5 feature maps in Block1, each feature map (except the first feature map) is obtained by the previous feature map through a group (respectively through group 2, group 3, group 4 and group 5) of convolutional layers plus pooling layers in VGG16, and VGG16 is obtained by using a maxpool layer as a boundary point to group convolutional layers, for example, in table 1, group 1 has 2 convolutional layers, group 2 also has 2, group 3 has 3, group 4 has 3, group 5 also has 3, and the first feature map in Block1 is obtained by passing an input image through the first group 1 of convolutional layers in VGG16, and adding a network layer 16 to the convolutional layer, the resulting signature is first subjected to a 2-fold upsampling deconvolution operation (the deconvolution operations in the second and third rows of the network in fig. 1 are identical to those in the branches of the first row, and the parameters are all the same), where the parameters of the deconvolution layer are: kernel _ size 4, padding 1, and strides 2, which are parameters designed in advance according to the upsampled multiple, and then feature-fused with a feature map obtained from a conv4_3 layer (i.e. the 10 th convolutional layer in table 1, one set of convolutional layers in VGG16 is bounded by a maxpool layer, and conv4_3 represents the 4 th set of 3 rd convolutional layers in VGG16) of the VGG16 network, and thereby generating candidate positions of the target in the bird's eye view; the term "VGG 16" is not necessarily required here, and may be another K (2 to 5) convolutional layer, which means that the feature extraction network here is not necessarily a VGG16 network, but may be another network, where K represents a network satisfying requirements, and (2 to 5) represents the number of convolutional layers, and only two feature maps extracted last are needed, and the relationship of 2 times in size is good, and the last two feature maps here mean that if image data is input, N feature maps (including a feature map of an intermediate result and a feature map output last) are generated in total through the entire feature extraction network (CNN), and after a set of convolutional layers, a feature map of an intermediate result can be extracted, and the last two feature maps refer to the N-1 th and N-th feature maps.

Similarly, the second line branch of the network in FIG. 1 generates candidate locations for objects in the inspection image (i.e., the original road-vehicle RGB image). The candidate positions generated by the two branches of the first and second rows are simultaneously used for generating a sub-network through a 3D candidate frame, and performing spatial matching to generate a target suggestion frame in a 3D space, which is the function of the 3D proxy in fig. 1.

After the third branch is subjected to feature extraction, the feature fusion method is the same as that of the first two branches. And taking out the feature maps of the three branches after respective fusion, and carrying out total feature map fusion at M positions in the network by pixel-by-pixel addition and average to obtain a final feature map fusion result. And projecting the 3D target suggestion frame generated at the 3D proxy position on the finally fused feature map to form a 2D target suggestion frame (2D proxy), obtaining ROI regional features corresponding to the 2D target suggestion frame, and finally carrying out classification and Bbox regression on the R OI regional features through two fully connected layers to obtain a final 3D target detection candidate frame.

Table 1 specifically and additionally illustrates parameters of the feature extraction layer VGG16, where 3 to 64 indicate that the current layer convolution kernel size is 3 × 3, and the output is 64 channels.

TABLE 1 VGG16 network architecture schematic (without full connectivity layer)

Input Image	Conv3-64	Conv3-64	maxpool	Conv3-128	Conv3-128
						maxpool	Conv3-256	Conv3-256	Conv3-256	maxpool	Conv3-512
Conv3-512	Conv3-512	maxpool	Conv3-512	Conv3-512	Conv3-512

The invention provides a 3D target detection method, which belongs to the field of artificial intelligence.

The invention provides a 3D target detection method, which is realized by adopting a deep convolutional network and specifically comprises the following steps:

(1) preparing a data set to be processed, comprising an original road vehicle RGB image data set and a corresponding target point cloud data set O_3DWherein

Three-dimensional coordinates representing point cloud data in the target point cloud dataset;

(5) performing feature extraction on the aerial view in the step (3) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FB after the aerial view passes through convolutional layers₁，FB₂In which FB₁Is of size FB₂2 times of the total weight of the composition;

(6) the characteristic diagram FB obtained in the step (5)₂Carry out one2 times of up-sampling processing, i.e. deconvolution operation, to obtain a feature map FB₃；

FB＝FB₁+FB₃(formula 3)

in step (8), a specific network implementation of the two-dimensional image target detection method is shown in fig. 2 and 3. In fig. 2, after an image is input, feature extraction is performed by using the first 7 convolutional layers of the VGG16 network to generate a feature map of multiple channels, that is, the feature map is obtained after the input image passes through the 7 th convolutional layer of the VGG16 network, and the feature map is sent to the RPN network next step. Compared with the traditional Selective Search, the RPN network is used for extracting the target candidate frame, namely the target candidate position on the feature map, the time consumption is less, and most importantly, the RPN network can be integrated in a deep learning network to realize end-to-end operation. For each position on the feature map, there are 9 possible candidate windows, and coverage detection of different scales of targets is realized. The size of the candidate window is {128, 256, 512} x three ratios {1:1,1:2,2:1 }. So RPN network outputs feature area windows of different sizes to be used as input for the next layer. Since the size of the feature region map input to the next layer is not fixed, it is common practice to cut and stretch, but this interferes with the original information. Therefore, an SPP network is added later, as shown in fig. 3, 3 kinds of pooled templates M1(4 × 4), M2(2 × 2), and M3(1 × 1) are used in the SPP network, where the interpretation of the 4 × 4 template in M1 is to divide the ROI map into 4 × 4 image blocks, i.e., 16 blocks, and then to take the average value of each block, i.e., to obtain a 16-dimensional vector. M2, M3 are identical. After template processing is carried out on the characteristic region windows with different sizes, results with the sizes of 4 multiplied by 512,2 multiplied by 512 and 1 multiplied by 512 are obtained respectively. The three results are respectively straightened into sizes of 16 × 512,4 × 512 and 1 × 512, and then are uniformly connected in one dimension, and the finally formed length is 21 × 512. The purpose is to transform a feature map of different size feature areas into a fully connected layer of uniform length. The SPP network is more robust than simple pooling, and because the feature regions with different aspect ratios and different sizes are processed, the SPP network improves the scale invariance of images, is easier to converge, and improves the accuracy of experiments.

(9) Performing feature extraction on the RGB images of the original road vehicles in the training set in the step (2) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FG after convolutional layers₁，FG₂In which FG₁Is FG in size₂2 times of the total weight of the composition;

FG＝FG₁+FG₃(formula 4)

(12) Obtaining the positions of the target candidate frames (namely the candidate positions of the target) in the bird's eye view image and the RGB image of the original road vehicle in the steps (8) and (11), respectively, generating sub-networks through one 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;

(13) performing feature extraction on the front view in the step (4) by using a feature extraction network (such as VGG16), and respectively obtaining feature maps FF after convolutional layers₁，FF₂Wherein FF₁Has a size of FF₂2 times of the total weight of the composition;

FF＝FF₁+FF₃(formula 5)

(17) removing height information from the 3D target suggestion frame obtained in the step (12), projecting the 3D target suggestion frame on the feature map F to form a 2D target suggestion frame, which is a process of overlooking projection, and obtaining a ROI (region of interest) feature map corresponding to the 2D target suggestion frame;

(18) and (4) after the ROI area characteristic diagram obtained in the step (17) passes through the full connection layer twice, classifying and regressing the target to obtain a final 3D target detection frame, namely the final detection result of the 3D target detection is a 3D frame framed on the target.

Therefore, training of the deep convolutional network is completed by using the training set, then the optimal deep convolutional network training model is selected by using the verification set, and the performance of the selected optimal deep convolutional network training model is tested in the later period or used in practical application by using the test set.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims

1. A3D target detection method is characterized in that the method is realized by adopting a deep convolutional network, and specifically comprises the following steps:

FB＝FB₁+FB₃equation 3

(8) The feature map FB is used for realizing a task of extracting a target candidate position in the aerial view through a two-dimensional image target detection method, so that the target candidate position in the aerial view is generated;

(10) the characteristic diagram FG in the step (9)₂Performing a 2-fold upsampling process, i.e. deconvolution operation, to obtain a feature mapFG₃；

(11) Subjecting the characteristic maps FG obtained in steps (9) and (10)₁And FG₃Summing, performing fusion processing to obtain a feature map FG, and generating target candidate positions in an original road vehicle RGB image by a two-dimensional image target detection method;

FG＝FG₁+FG₃equation 4

(12) Generating sub-networks by using the target candidate positions obtained in the steps (8) and (11) in the bird's eye view image and the original road vehicle RGB image respectively through a 3D candidate frame, and performing spatial matching combination to generate a 3D target suggestion frame;

FF＝FF₁+FF₃equation 5

therefore, the training set completes the training of the deep convolutional network, then the optimal deep convolutional network training model is selected by adopting the verification set, and the test set is used for completing the later test of the selected optimal deep convolutional network training model or being used in practical application, so that the 3D target detection method is realized.

2. The 3D object detection method according to claim 1, wherein the feature extraction network in the step (5), the step (9) and the step (13) is VGG 16.

3. The 3D object detection method according to claim 2, wherein the structure of the deep convolutional network comprises three branches from top to bottom, wherein the input image in the first row of branches is a bird's eye view image obtained by projecting point cloud data onto a two-dimensional plane; the input map in the second row of branches is an unprocessed raw road vehicle RGB image; the input image in the third row of branches is a front view obtained by projecting point cloud data to a cylindrical plane; the input graphs in the three branches need to be processed by a VGG16 feature extraction network, the VGG16 feature extraction network comprises 13 convolutional layers and 4 pooling layers in total, but does not comprise full-connection layers, and the 4 pooling layers divide the convolutional layers of the VGG16 feature extraction network into 5 groups, wherein the 1 st group comprises 2 convolutional layers with parameter representations of all Conv 3-64; group 2 included 2 convolutional layers with parametric representation all Conv 3-128; group 3 included convolutional layers with 3 parametric representations all Conv 3-256; group 4 included convolutional layers with 3 parametric representations all Conv 3-512; the 5 th group comprises 3 convolutional layers with the parameter representation of Conv3-512, the maximum pooling treatment is carried out by adopting 1 pooling layer with the same parameter after each group of convolutional layers, the size of the pooling core for carrying out the maximum pooling is 2 multiplied by 2, and the step length is 2; the parametric representation Conv3-64 indicates that the current layer convolution kernel size is 3 × 3, and the output is 64 channels; conv3-128 indicates that the current layer convolution kernel size is 3 × 3, with an output of 128 channels; conv3-256 indicates that the current layer convolution kernel size is 3 × 3, with 256 channels in output; conv3-512 indicates that the current layer convolution kernel size is 3 × 3, and the output is 512 channels; the sizes of convolution kernels in the VGG16 feature extraction network are uniformly set to be 3 x 3, and the step length is 1;

in the structure of the deep convolutional network, a total of 5 feature maps are contained in the Block1 in the first row of branches, each feature map except the first feature map is obtained by respectively processing the previous feature map through a corresponding group of convolutional layer plus pooling layers in the VGG16 feature extraction network, and the first feature map in the Block1 is obtained by processing an input image through the 1 st group of convolutional layer plus pooling layers in the VGG16 feature extraction network; performing deconvolution operation of 2 times of upsampling on a feature map obtained by processing the last 1 group of convolution layers and the pooling layer of the VGG16 feature extraction network, performing feature fusion on the feature map and a feature map obtained by the 3 rd convolution layer of the 4 th group of the VGG16 feature extraction network, and generating a target candidate position in the bird's-eye view image, wherein parameters of the deconvolution layer subjected to the deconvolution operation are as follows: kernel _ size 4, padding 1, and strides 2, which are preset according to the multiple of upsampling;

in the structure of the deep convolutional network, performing the same operation on a Block1 in a second row branch as a Block1 in a first row branch, and performing the same deconvolution operation in the second row branch as the deconvolution operation in the first row branch and parameters thereof to obtain target candidate positions in the original road vehicle RGB image output by the second row branch;

after the third row of branches are subjected to feature extraction, the feature fusion method is the same as that of the first row of branches and that of the second row of branches, the feature maps of the three rows of branches after respective fusion are taken out, and the total feature map fusion is carried out at M positions in the network through pixel-by-pixel addition averaging to obtain a final feature map fusion result.

4. The 3D object detection method according to claim 3, wherein the two-dimensional image object detection method in the steps (8) and (11) is implemented by a network: after an image is input, a multi-channel feature map is generated by adopting a VGG16 network, the feature map is sent to an RPN network in the next step, the RPN network is used for extracting a target candidate frame on the feature map, namely a target candidate position, each position on the feature map has 9 possible candidate windows, the coverage detection of targets with different scales is realized, the size of the candidate windows is {128, 256, 512} multiplied by three proportions {1:1,1:2,2:1}, so that feature region windows with different sizes are output by the RPN network and are used as the input of the next layer; adding an SPP network after extracting target candidate frames on a feature map, wherein 3 pooling templates M1(4 × 4), M2(2 × 2) and M3(1 × 1) are used in the SPP network, wherein the interpretation of the 4 × 4 template in M1 is that an ROI map is divided into 4 × 4 image blocks, namely 16 blocks, then the average value of each block is taken to obtain a 16-dimensional vector, the interpretation of M2 and M3 is the same as that of M1, after template processing is carried out on feature region windows with different sizes, results with the sizes of 4 × 4 × 512,2 × 2 × 512 and 1 × 1 × 512 are respectively obtained, the three results are respectively straightened to be the sizes of 16 × 512,4 × 512 and 1 × 512, and then are connected in one dimension, and the finally formed length is 21 × 512, so that the feature maps of feature regions with different sizes are uniformly converted into a fully connected layer with uniform length.