CN113111722A

CN113111722A - Automatic driving target identification method based on improved Mask R-CNN

Info

Publication number: CN113111722A
Application number: CN202110287700.3A
Authority: CN
Inventors: 董恩增; 杨启娟; 佟吉钢; 冯进峰; 张祖锋; 于航
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-07-13

Abstract

The invention belongs to the technical field of machine vision, and relates to an automatic driving target identification method based on improved Mask R-CNN. S1, reading picture information and then preprocessing the picture information to obtain a characteristic diagram of the picture; s2, inputting the feature map into the regional recommendation network module to obtain a recommendation frame; s3, judging whether a target exists in the recommendation frame through a classification layer, distinguishing the target and a background in the recommendation frame, determining the position of the target by utilizing boundary regression, determining an ROI (region of interest) from the screened feature map, and removing redundant recommendation frames by utilizing a non-maximum suppression NMS (network management system) algorithm to obtain an accurate recommendation frame; s4, after the ROI is processed, the mask module utilizes the FCN to segment each ROI and outputs a feature map; and S5, collecting the ROI area by a classification and frame regression module, calculating the classification loss and the regression loss based on a Kullback-Leibler loss boundary frame in the ROI area in the module, and determining an accurate recommended frame by using an NMS (network management system) method to realize the identification and segmentation of the target in the picture.

Description

Automatic driving target identification method based on improved Mask R-CNN

Technical Field

The invention belongs to the technical field of machine vision, and particularly relates to an automatic driving target identification method based on improved Mask R-CNN.

Background

The automobile automatic driving technology starts from an unmanned automobile project of Google, and in recent years, automatic driving is rapidly developed along with the continuous application of deep learning in image processing, so that the application of a road environment perception method based on the deep learning in automatic driving is possible. However, in the field of automatic driving, because road conditions in road traffic are complex, vehicles have different densities and densities, and vehicle speed fluctuation is relatively large, so that the requirement for identifying the vehicles and the surrounding environment is relatively high. The most important thing in automatic driving is visual perception, and the weather conditions are severe, such as rain, snow and haze; the road conditions are complex, and when people and vehicles come and go densely, the complex road conditions still present a challenge to the visual perception algorithm.

Research shows that compared with other image segmentation methods based on deep learning, such as SDS, CFM, MNC and the like, the image segmentation method based on the Mask R-CNN can realize detection segmentation of different individuals in the same category, and greatly improves the segmentation accuracy. Mask R-CNN is one of mainstream frames of a target identification and segmentation algorithm based on CNN (convolutional neural network), a ResNet-101+ FPN Feature extraction network input picture is adopted for Feature extraction, then 9 anchor boxes are predicted for each pixel point on Feature maps, and 300 anchor boxes with high classification scores are selected as a final ROI area. And finally, sending the ROI area into a Mask module, a Classification module and a Boundingbox regression module to judge the target type and obtain an accurate target position. Mask R-CNN is a computer vision algorithm which is relatively close to human real visual perception, and has high application value in the field of automatic driving. However, the Mask R-CNN algorithm has the defects of poor edge segmentation effect and poor segmentation effect on small targets due to fuzzy target bounding boxes.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an automatic driving target identification method based on improved Mask R-CNN. The method can well cope with complex and variable traffic environments and accurately detect and segment the vehicles and the road conditions around the vehicles by improving the Mask R-CNN algorithm, and the improved method has good applicability in automatic driving application.

The automatic detection, segmentation and identification of various large, medium and small targets are key technologies for automatic driving visual perception, particularly to vehicles and pedestrians. Aiming at the complicated road traffic conditions and the high-precision requirement of the changed speed on target segmentation and identification, the invention provides an improved Mask R-CNN algorithm based on KL loss. Adopting a ResNet-101+ FPN Feature extraction network to realize better Feature maps and fully utilizing the extracted features of each stage; and the network structure KL loss for estimating the position confidence coefficient is adopted to carry out regression on the boundary box, so that the segmentation accuracy of the boundary is greatly improved due to the fuzzy of the boundary box, and the additional calculation cost is hardly increased. In order to perfect the target detection and segmentation effect of the algorithm in severe weather such as rain, snow and the like, 8000 pictures in an automatic driving data set Cityscapes and 1942 pictures in an MS-COCO data set are combined to train the model. Experimental results show that compared with Mask R-CNN, the algorithm provided by the invention has the advantages that the segmentation precision and recall are obviously improved, and the generalization capability and practicability are good in an automatic driving scene.

In order to achieve the purpose, the invention adopts the following technical scheme:

the automatic driving target identification method based on the improved Mask R-CNN comprises the following steps,

s1, preprocessing the picture information after reading the picture information to obtain a characteristic diagram of the picture;

s2, inputting the feature map into the regional recommendation network module to obtain a recommendation frame;

s3, judging whether a target exists in the recommendation frame through a classification layer, distinguishing the target and a background in the recommendation frame, determining the position of the target by utilizing boundary regression, determining an ROI (region of interest) from the screened feature map, and removing redundant recommendation frames by utilizing a non-maximum suppression NMS (network management system) algorithm to obtain an accurate recommendation frame;

s4, after the ROI is processed, the mask module utilizes the FCN to segment each ROI and outputs a feature map;

and S5, collecting the ROI area by a classification and frame regression module, calculating the classification loss and the regression loss based on a Kullback-Leibler loss boundary frame in the ROI area in the module, and determining an accurate recommended frame by using an NMS (network management system) method to realize the identification and segmentation of the target in the picture.

The automatic driving target recognition method based on improved Mask R-CNN as claimed in claim 1, wherein the step S1 is to first scale the picture, then input the scaled picture into the residual error network 101+ feature pyramid feature extraction network of the feature extraction network module, and then extract the feature map of the picture after passing through the full convolution network.

In a further optimization of the present technical solution, in step S2, the regional recommendation network module traverses the feature map by using a sliding window, and predicts a plurality of anchor frames for each pixel to generate a recommendation frame.

In a further optimization of the present technical solution, the size of the sliding window is 3 × 3.

According to the technical scheme, the further optimization is carried out, the size of each anchor frame predicted by each pixel is 6,6 sizes are {2,4,8,16,64 and 256}, the ratio is 9 {0.3:1, 0.5:1, 0.7:1, 0.9:1, 1:1, 1.5:1, 2:1, 2.5:1 and 3:1}, and the total number of the anchor frames is 54.

In a further optimization of the technical scheme, the reference window of the anchor frame is set to 16 × 16, so that the area S of the anchor frame_kAs follows below, the following description will be given,

S_k＝(16*2^k) K∈[1,6] (1)

the length-width ratio of the anchor frame is a:1, the width W of each anchor frame_KLong H_KAs follows below, the following description will be given,

in the further optimization of the technical solution, the formula of the threshold value screening method of the NMS algorithm in step S3 is as follows,

wherein B ═ B₁,b₂,L L b_nIs a series of initial test frames, S ═ S₁,s₂,.....s_nAre their corresponding classification scores, N_tIs a threshold for the degree of overlap.

In the further optimization of the present technical solution, the processing on the ROI in step S4 specifically includes performing bilinear interpolation alignment operation on the ROI, and fixing the size of the ROI to a uniform size.

In the further optimization of the present technical solution, in step S5, through the offset of the full connection layer to the target object classification result and the boundary regression, as shown in formulas (5) and (6),

in the formula (5), the reaction mixture is,

represents a classification penalty, defined as

Wherein p is_iIt is the region recommendation that predicts the probability of a target object,

is a label of a real calibration frame,

represents the bounding box regression loss, defined as smoothL₁(t-t^*) Wherein, in the step (A),

in the formula (6), x_gIs the basic GT bounding box location, x_eIs the position of the bounding box to be estimated, D_KLIs a KL distance, P_DIs the basic GT dirac function, P_θIs the predicted gaussian distribution, and h (p) is the information entropy.

In a further optimization of the present technical solution, in step S5, the boundary regression loss is defined as a KL distance between the predicted distribution and the distribution of the true calibration box, and the standard deviation of the position of the boundary box and the position of the boundary box are used together to evaluate the klloss and used for regression of the boundary box.

Different from the prior art, the technical scheme has the following advantages:

A) the automatic driving technology has more strict requirements on the detection accuracy and the missing pick rate of the tiny target and the shielded object. The Feature extraction network of the Feature extraction network is ResNet-101+ FPN, and the ResNet adopts cross-layer connection, so that the training is easier, and the FPN realizes better Feature maps fusion. The method obtains feature maps through a convolutional neural network from bottom to top, and then fuses the feature maps in a top-to-bottom and transverse connection mode, so that each layer of network after fusion has deep-level and shallow-level characteristics, and the extraction of the characteristics of each stage can be fully guaranteed.

B) The proportion and the scale of the anchor boxes in the Region pro-technical network module are modified by matching with the visual field requirement of automatic driving and combining the conditions of small target objects and large quantity in the automatic driving. The modified anchor boxes improve the detection capability of the RPN on the target, and particularly show better recall rate in the detection segmentation of small targets.

C) The original Mask R-CNN algorithm depends on regional information in a segmentation task, an average cross entropy loss function is adopted, and a regression positioning target of a boundary box is utilized, so that the segmentation accuracy of the boundary can be influenced due to the fuzzy of the boundary box, and the judgment decision of a vehicle in the automatic driving process and the timely effectiveness of the action of the vehicle are influenced. The method adopts a network structure KLloss for estimating the position confidence coefficient to carry out regression on the bounding box. The loss function can greatly improve the accuracy of various frameworks, hardly increases extra calculation cost, and improves the segmentation accuracy of the fuzzy small target of the bounding box, so that a timely and effective judgment decision can be made in the automatic driving process.

D) The network model is trained in a staged training mode, and parameter adjustment is performed according to each stage, so that invalid training is avoided, and video memory is saved.

E) The training set adopts a mixed data set of an automatic driving data set Cityscapes and MS-COCO, and the detection segmentation precision and the generalization capability of the model under various weather conditions and complex traffic environments are effectively improved.

Drawings

FIG. 1 is a diagram of an improved Mask R-CNN network;

FIG. 2 is a schematic diagram of a ResNet-101+ FPN feature extraction network;

FIG. 3 is a diagram of a network structure for estimating confidence KL loss of a bounding box position;

FIG. 4 is a graph of Precision-Recall relationship;

FIG. 5 is a graph of the effect of segmentation detection in snowy weather with air disturbances;

FIG. 6 is a graph of the effect of segmentation detected in rainy weather with air disturbances;

FIG. 7 is a diagram of segmentation effect on various vehicles and pedestrians in a complex road scene;

FIG. 8 is a segmentation effect diagram of the target vehicle with occlusion and truncation;

FIG. 9 is a segmentation effect diagram of a target vehicle under different degrees of shielding or smaller targets;

fig. 10 is a diagram showing the effect of segmentation at an intersection where vehicles and pedestrians are comparatively concentrated.

Detailed Description

To explain technical contents, structural features, and objects and effects of the technical solutions in detail, the following detailed description is given with reference to the accompanying drawings in conjunction with the embodiments.

The invention provides an automatic driving target identification method based on improved Mask R-CNN, which comprises the following steps:

s1, reading the picture information, and then preprocessing the picture information, scaling the picture to 1600 × 700, and then entering a residual error network 101+ Feature pyramid (ResNet-101+ FPN) Feature extraction network in a Feature extraction network (Feature extraction network) module, as shown in fig. 1, which is an improved Mask R-CNN network structure diagram. After passing through 91-layer full convolution networks of Conv1, Conv2_ x, Conv3_ x and Conv4_ x of ResNet-101+ FPN, characteristic maps (feature maps) of pictures are extracted.

S2, the feature map output by the feature extraction network module enters a regional recommendation network (Region recommendation network) module, and as shown in fig. 2, a schematic diagram of the ResNet-101+ FPN feature extraction network is shown. The Regionproposal network module traverses Feature maps using a 3 × 3 sliding window, predicts multiple anchor boxes (anchor boxes) per pixel, and generates recommended boxes (Proposal boxes). In order to enable the anchor frame to basically cover various dimensions and shapes of the target object, after a large number of experimental verifications, the invention sets that the dimensions of the anchor frame predicted by each pixel are 6,6 dimensions are {2,4,8,16,64,256}, the ratio is 9 {0.3:1, 0.5:1, 0.7:1, 0.9:1, 1:1, 1.5:1, 2:1, 2.5:1, 3:1}, and the total number is 54 anchor frames. The invention sets the reference window of the anchor frame as 16 x 16, so the area S of the anchor frame_kAs shown in formula (1).

S_k＝(16*2^k) K∈[1,6] (1)

The length-width ratio of the anchor frame is a:1, the width W of each anchor frame_KLong H_KAs shown in formulas (2) and (3).

S3, judging whether a target exists in the generated recommendation frame through a two-classification (Softmax) layer, distinguishing the target and the background in the recommendation frame, determining the position of the recommendation frame through bounding box regression (BBoxes regression), and finally determining a final Region of Interest (Region of Interest) ROI area from about 300 screened recommendation frames. And removing redundant target frames by using a Non-Maximum Suppression (Non-Maximum Suppression) NMS algorithm to obtain an accurate recommendation frame, wherein a threshold screening method of the NMS algorithm is shown as a formula (4).

S4, perform bilinear interpolation alignment (roiign) operation on the ROI region, and fix the size of the ROI region to a uniform size.

S5, a Mask (Mask) module segments each ROI region by using the FCN network, and outputs K m × m feature maps, that is, K classes of m × m binary masks, to obtain a spatial layout of m × m, thereby obtaining a segmentation Mask of the target.

S6, a Classification (Classification) and bounding box regression module collects the ROI area, the ROI area calculates the Classification loss in the module and the regression loss based on a Kullback-Leibler loss (KL-loss) bounding box, as shown in FIG. 3, the ROI area is a KL-loss network structure diagram for estimating the confidence of the position of the bounding box. And obtaining the classification result of the target object and the offset of the boundary regression through a full connection layer (FC layers), as shown in formulas (5) and (6).

In the formula (5), the reaction mixture is,

represents a classification penalty, defined as

Wherein p is_iIs the Region recommendation (Region probability) prediction as target object probability,

is a real calibration box (groudtruth) tag.

Represents the bounding box regression loss, defined as smoothL₁(t-t^*) Wherein

The boundary regression loss is defined on this basis as the KL distance between the predicted distribution and the real calibration box (groudtruth) distribution, and the BBox position and the standard deviation of the BBox position are used together to estimate kloss and for the boundary box regression. In the formula (6), x_gIs the basic GT bounding box position; x is the number of_eIs the bounding box position to be estimated; d_KLIs a KL distance; p_DIs the basic GT dirac function, P_θIs the predicted gaussian distribution, and h (p) is the information entropy, which is typically small and fixed.

Establishing a training data set: 8000 training images in an automatic driving data set Cityscapes are selected as a training data set, and the training data set comprises real image data of street scenes under different driving conditions of different cities. In order to improve the target segmentation precision of the training model under severe weather conditions and complex traffic conditions, 1942 training pictures in an MS-COCO data set are added, wherein the weather conditions comprise 'snow', 'rain' and 'sunny'; the complex traffic conditions include "traffic flow is large" and "traffic congestion" and the like. In experiments, to fit a mixed dataset for use in an improved algorithm, it was formatted as an MS-COCO dataset.

Training a network model: training ResNet-101+ FPN in the Feature extraction network module in ImageNet, taking a network model obtained after training as a pre-training model, and making a hybrid automatic driving data set into fine-tuning; the training is divided into three phases: the first phase is Training network headers; the second stage is Fine-tuning ResNet stage 4and up; the third stage Fine-tuning all layers. The learning rate of the algorithm is set to 0.01, and in the iteration process, parameters are set to enable the learning rate to be decreased gradually. In the initialization process, the weights of the full connection layer are initialized by random gaussians. The standard deviation and the mean are set to 0.0001 and 0, respectively, so that KL loss and smooth L of the standard₁Losses are similar during the initial training phase.

The purpose of the algorithm improvement of the invention is to improve the deficiency of the Mask R-CNN algorithm and make the algorithm accord with the technical index of the automatic driving task. The ratio and the scale of the anchor boxes in the Region porous network module are suitable for medium and large-sized targets, and certain defects exist in the detection of small targets; the problem of target missing detection is caused by incomplete features extracted by the feature extraction network; and poor model generalization capability in an automatic driving scene.

In a preferred embodiment of the invention, the automatic driving target identification method based on the improved Mask R-CNN comprises the following steps,

(1) the input picture is first scaled to 1600 × 700, and then passes through Conv1, Conv2_ x, Conv3_ x and Conv4_ x of the ResNet-101+ FPN Feature extraction network to obtain picture Feature maps, as shown in fig. 2, which is a schematic diagram of the ResNet-101+ FPN Feature extraction network.

(2) And (3) traversing each Feature maps pixel obtained in (1) by taking the anchor point of the central point as a reference, wherein each anchor point can predict 6 mesoscales {2,4,8,16,64,256} and 9 proportions {0.3:1, 0.5:1, 0.7:1, 0.9:1, 1:1, 1.5:1, 2:1, 2.5:1, 3:1}, and taking the 54 anchor boxes as initial detection frames. The largest anchor boxes are 1774 x 590 and the smallest anchor boxes are 56 x 16, so 54 anchor boxes basically cover various dimensions and shapes of the target object. And judging whether a target exists in the interior through a Softmax layer, distinguishing the target and the background in the Proposal boxes, determining the position of the target and the background by utilizing BBoxes regressive, and then determining a final ROI (region of interest) from about 300 screened Proposal boxes. And finally, removing redundant target boxes by using an NMS algorithm to obtain an accurate Proposal box.

(3) Synthesizing Feature maps obtained in (1) and Proposal box obtained in (2) and sending the synthesized Feature maps and Proposal box into a Mask module&Classification&In the Boundingbox regression module, a Mask module utilizes an FCN to segment each ROI area to obtain a segmentation Mask of a target; classication&The Boundingbox regression module passes through KLloss and standard smooth L₁The bounding box regression loss and classification loss were calculated. And meanwhile, determining an accurate Proposal box by using an NMS method, and finishing the identification and segmentation of the target in one picture.

(4) Establishing a training data set: 8000 training images in an automatic driving data set Cityscapes are selected as a training data set, and the training data set comprises real image data of street scenes under different driving conditions of different cities. In order to improve the target segmentation precision of the training model under severe weather conditions and complex traffic conditions, 1942 training pictures in the MS-COCO data set are added. In experiments, to fit a mixed dataset for use in an improved algorithm, it was formatted as an MS-COCO dataset.

(5) Training ResNet-101+ FPN in ImageNet, taking a network model obtained after training as a pre-training model, and taking a hybrid automatic driving data set as fine-tuning; the training is divided into three phases: the first phase is Training network headers; the second stage is Fine-tuning ResNet stage 4and up; setting the learning rate of the algorithm to be 0.01 network model training: in the Feature extraction network module, parameters are set so that the learning rate is decreased in an iterative process. In the initialization process, the weights of the full connection layer are initialized by random gaussians. The standard deviation and the mean were set to 0.0001 and 0, respectively.

Results and analysis of the experiments

Experimental environment and parameters

The experimental environment of the invention is carried out under an operating system of Ubuntu16.04, the server adopts an Intel Xeon Silver 41102.10 GHz 8-core CPU, is provided with 2 Hynix 64GB DDR4-2666MHz memories and is provided with 2 display cards of GTX2080TI 11G. And (3) building a Tensorflow deep learning framework on the basis, and realizing the training and testing of the network by using Python language programming.

Qualitative and quantitative analysis model accuracy

Accepted indicators of evaluation in the target detection and identification task are a Precision-Recall relationship curve, an AP (interpolated Average Precision) value, and an mAP (mean Average Precision) value.

The Precision-Recall relation curve is a curve drawn by taking Precision as a vertical coordinate and Recall as a horizontal coordinate, and the quality of the classification condition of each type of object by the system is qualitatively evaluated by adjusting a threshold value and observing the change of the curve.

Precision in the Precision-Recall relationship curve reflects the proportion of True positives (True positives) in the correctly identified targets, and the calculation formula is shown in formula (7),

in the formula, TP: true positives. FP: false positives.

Recalling rate (recalling rate) reflects the proportion of a certain correctly identified target object in the object, and a calculation formula is shown in a formula (8).

Wherein, TP: true positives. FN: false negatives.

FIG. 4 shows a Precision-Recall curve for qualitative analysis of the present algorithm. The curves of various objects at the upper right corner in the Precision-Recall relation curve graph are all in a convex shape, which shows that the algorithm has good identification effect and high accuracy.

The invention uses the precision of quantitative analysis model of various object AP (interpolated Average precision) values; the detection and identification effects of the algorithm on the data set are evaluated by using the value of mAP (mean Average precision). The AP (average Precision) value is the area under the Precision-Recall relationship curve, which is used to quantify the model accuracy. In order to avoid the problem that the AP value is low due to the instability of the PR curve, the invention uses a calculation method of 'interrupted Average Precision', namely, in the threshold Precision used each time, the Precision value of the maximum value is multiplied by the Recall value, and then the product values obtained under all the thresholds are accumulated, as shown in a formula (9).

Wherein P is Precision. R is Recall.

In multi-target detection and identification of pictures, mAP values are used to measure the quality of models in the classification task of all classes of objects. The mAP is the average value of AP values of a plurality of classes of objects, and the larger the value is, the higher the detection precision is, and the better the performance of the algorithm is.

The data set is a hybrid automatic driving data set, and the measurement index is AP^small(target pixel area less than 32)²)，AP^medium(target pixel area greater than 32)²And less than 96²)，AP^large(target pixel area greater than 96²). The results of the experiment are shown in table 1.

TABLE 1 Large, Medium, and Small three categories of target AP value to compare

	mAP	APs	APm	APl
					BBoxreg	23.6	33.2	32.1	37.1
BBoxreg+BBoxreg Std	30.9	34.9	33.5	40.2

Table 2 shows the comparison of the maps of the algorithm of the present invention under different network structures.

Table 2 mAP value comparison under different network architectures

The following conclusions can be easily drawn from the experimental results: compared with the original method without adding the FPN network, the method has the advantages that the FPN network is added into the basic network structure ResNet-101, and the network structure of the algorithm is obviously improved in mAP value; although the collocation of the ResNeXt-101+ FPN network structure can improve more mAP values, the network parameters are more, so that certain limitation is caused on the operation speed, and the requirement of automatic driving on quick response cannot be met.

Results of the experiment

The test results of the algorithm of the invention after training on the autopilot hybrid data set are shown in fig. 5-6. Fig. 5 and 6 show the detection and segmentation effects of the training network model under the air interference in snowy days and rainy days. The test results of the algorithm of the invention after training on the autopilot hybrid data set are shown in fig. 7-9. It can be seen from fig. 7 that the algorithm has a good segmentation effect on various vehicles and pedestrians in a complex road scene; in fig. 8, the leftmost target vehicle has an occlusion, but the algorithm of the present invention can still accurately lock and segment the target vehicle; for target vehicles with blurred left vehicles due to tree occlusion in fig. 9, the segmentation algorithm can still overcome the occlusion and accurately detect and identify the target vehicles. Fig. 10 shows that at the intersection where the vehicles and pedestrians are concentrated, the accuracy of the segmentation algorithm of the invention is not reduced, and the bicycles and the pedestrians can be accurately segmented.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrases "comprising … …" or "comprising … …" does not exclude the presence of additional elements in a process, method, article, or terminal that comprises the element. Further, herein, "greater than," "less than," "more than," and the like are understood to exclude the present numbers; the terms "above", "below", "within" and the like are to be understood as including the number.

Although the embodiments have been described, once the basic inventive concept is obtained, other variations and modifications of these embodiments can be made by those skilled in the art, so that the above embodiments are only examples of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes using the contents of the present specification and drawings, or any other related technical fields, which are directly or indirectly applied thereto, are included in the scope of the present invention.

Claims

1. The automatic driving target identification method based on the improved Mask R-CNN is characterized by comprising the following steps,

2. The automatic driving target recognition method based on improved Mask R-CNN as claimed in claim 1, wherein the step S1 is to first scale the picture, then input the scaled picture into the residual error network 101+ feature pyramid feature extraction network of the feature extraction network module, and then extract the feature map of the picture after passing through the full convolution network.

3. The improved Mask R-CNN-based automatic driving target identification method of claim 1, wherein in step S2, the regional recommendation network module traverses the feature map by using a sliding window, and each pixel predicts a plurality of anchor frames to generate a recommendation frame.

4. The improved Mask R-CNN based automatic driving target recognition method according to claim 3, wherein the size of the sliding window is 3 x 3.

5. The improved Mask R-CNN-based automatic driving target identification method according to claim 3, wherein the predicted anchor frame size of each pixel is 6,6 scales are {2,4,8,16,64,256}, and the ratio is 9 {0.3:1, 0.5:1, 0.7:1, 0.9:1, 1:1, 1.5:1, 2:1, 2.5:1, 3:1}, and the total number is 54.

6. The automatic driving target recognition method based on improved Mask R-CNN as claimed in claim 3,

characterized in that the anchor frame reference window is set to 16 x 16, so that the area S of the anchor frame_kAs follows below, the following description will be given,

S_k＝(16*2^k)K∈[1,6] (1)

7. the automatic driving target recognition method based on improved Mask R-CNN as claimed in claim 1,

wherein the threshold value screening formula of the NMS algorithm in the step S3 is as follows,

8. The improved Mask R-CNN based automatic driving target recognition method according to claim 1, wherein the processing of the ROI in step S4 is to perform bilinear interpolation alignment operation on the ROI, and fix the size of the ROI to a uniform size.

9. The improved Mask R-CNN-based automatic driving target identification method according to claim 1, wherein the step S5 is implemented by fully connecting layers to the target object classification result and the offset of boundary regression, as shown in formulas (5) and (6),

in the formula (5), the reaction mixture is,

represents a classification penalty, defined as

is a label of a real calibration frame,

10. The improved Mask R-CNN based automatic driving target recognition method according to claim 1, wherein the step S5 defines the boundary regression loss as KL distance between the predicted distribution and the real calibration box distribution, and uses the standard deviation of the bounding box position and the bounding box position together for evaluating KLloss and for regression of the bounding box.