CN113673444B

CN113673444B - Intersection multi-view target detection method and system based on angular point pooling

Info

Publication number: CN113673444B
Application number: CN202110971811.6A
Authority: CN
Inventors: 张新钰; 李骏; 李志伟; 高鑫; 魏宏杨; 王力; 熊一瑾
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-19
Filing date: 2021-08-19
Publication date: 2022-03-11
Anticipated expiration: 2041-08-19
Also published as: CN113673444A

Abstract

The invention discloses a method and a system for detecting intersection multi-view targets based on angular point pooling, wherein the method comprises the following steps: preprocessing images of the intersection multi-view cameras collected in real time; inputting the preprocessed images of the multi-view camera into a pre-established and trained intersection multi-view target detection model, and outputting a target prediction result; the multi-view target detection model is used for extracting the features of the image of the multi-view camera after preprocessing, performing feature projection, feature fusion and corner pooling on the extracted features, predicting the target position through a ground plane rectangular feature map after corner pooling, performing single-view detection and result projection on the extracted features, correcting the target position through a single-view target position mapping map and outputting a target prediction result.

Description

Intersection multi-view target detection method and system based on angular point pooling

Technical Field

The invention belongs to the field of target detection, and particularly relates to a corner pooling-based intersection multi-view target detection method and system.

Background

With the rapid development of unmanned driving and smart cities, the vehicle detection technology of a single sensor is relatively mature, however, in intersections with complex traffic conditions, factors such as high difficulty in dense detection caused by vehicle congestion, shielding problems caused by bulky vehicles, uncertainty of the single sensor and the like seriously restrict the accuracy of vehicle detection, and potential safety hazards also exist in the complex intersections. With the introduction of a multi-view detection method, the detection performance of vehicles at the intersection in crowded or sheltered scenes is remarkably improved, and the method has a great promotion effect on the safety of unmanned driving. However, the multi-view-based vehicle detection method is often accompanied by the fusion of multi-sensor data, and the integration of the multi-view data together to realize vehicle detection can be realized through multi-view result level fusion and multi-view feature level fusion, but they have the following problems respectively:

1. multi-view result level fusion: the data of each view requires a separate computing unit, which inevitably brings a large amount of overhead of computing resources. When the detection results of all the visual angles are projected together, due to the error of perspective transformation and the distortion of the edge during image splicing, the results of the targets in the visual angle overlapping area in different visual angles are often inconsistent, which causes the ghost phenomenon of the vehicle detection result and brings great uncertainty to the decision of unmanned driving.

2. Multi-view feature level fusion: after the characteristics of the multi-view data are extracted, all calculation processes are completed on an independent calculation unit so as to reduce the calculation redundancy. However, feature fusion can only reduce the amount of computation, and does not substantially improve the "ghost" phenomenon, but detects two ghosts as a larger target, which also interferes with the final decision.

Disclosure of Invention

The invention aims to overcome the technical defects and provides a crossing multi-view target detection method based on corner pooling. In addition, the intersection multi-view vehicle detection based on the angular point pooling effectively improves the detection precision and the robustness of the model due to the fact that the angular point information of the vehicle features is enhanced.

In order to achieve the above object, the present invention provides a method for detecting intersection multi-view objects based on angular point pooling, which comprises:

preprocessing images of the intersection multi-view cameras collected in real time;

inputting the preprocessed images of the multi-view camera into a pre-established and trained intersection multi-view target detection model, and outputting a target prediction result; the multi-view target detection model is used for extracting the features of the image of the multi-view camera after preprocessing, performing feature projection, feature fusion and corner pooling on the extracted features, predicting the target position through a ground plane rectangular feature map after corner pooling, performing single-view detection and result projection on the extracted features, correcting the target position through a single-view target position mapping map and outputting a target prediction result.

Further, the intersection multi-view target detection model comprises: the system comprises a feature extraction module, a multi-view feature projection module, a feature fusion module, a feature map corner pooling module, a single-view detection module and a prediction module;

the characteristic extraction module is used for extracting characteristics of the images of the cameras with the multiple visual angles to obtain characteristic graphs of the multiple visual angles;

the multi-view characteristic projection module is used for projecting the characteristic diagrams of a plurality of views onto a bird's-eye view plane based on perspective transformation by utilizing the calibration file of each camera to obtain a cascade projection characteristic diagram of the plurality of cameras;

the feature fusion module is used for fusing the cascade projection feature maps of the cameras with the camera coordinate feature map of the 2 channels and outputting a ground plane rectangular feature map of one (NxC +2) channel, wherein N is the number of the cameras, and C is the number of feature channels extracted from the image of each camera;

the feature map corner pooling module is used for performing corner pooling on the horizontal plane rectangular feature map and outputting the horizontal plane rectangular feature map after the corner pooling;

the single-view detection module is used for performing corner pooling on the feature map of each view to obtain a plurality of single-view target detection results, projecting the single-view target detection results onto a bird's-eye view plane and outputting a single-view target position mapping map;

and the prediction module is used for predicting the target position by using the ground plane rectangular characteristic map subjected to corner pooling, correcting the target position by using the single-view detection result of the single-view target position mapping map, and outputting a target prediction result.

Further, the feature extraction module uses a ResNet50 network, including: one 1x1 convolutional layer for dimensionality reduction, one 3x3 convolutional layer, and one 1x1 convolutional layer for recovery dimensions.

Further, the specific implementation process of the multi-view feature projection module is as follows:

projecting the feature map of each view onto a bird's eye view plane:

wherein s is a real number scale factor, u and v are coordinates before projection, and x, y and z are coordinates after projection; a is a camera intrinsic parameter matrix of 3 multiplied by 3; [ R | t ] is a 3 × 4 joint rotation-translation matrix, where R represents rotation and t represents translation; for each camera calibration file, quantizing the ground plane position into a grid with the size of H x W, wherein H and W are the length and the width of the final generated aerial view; the image is projected according to a perspective transformation to ground plane z equal to 0, and the ground plane positions outside the field of view are filled with zeros.

Further, the specific implementation process of the feature map corner pooling module is as follows:

copying 3 parts of the fused ground plane rectangular feature maps, and performing maximum pooling of all feature vectors of 4 identical ground plane rectangular feature maps to the left, the right, the upward and the downward respectively;

in the pooling process of a certain direction, firstly setting the first characteristic value of the edges of all characteristic vectors as a maximum value, if the subsequent characteristic value is smaller than the maximum value, performing maximum pooling on the small characteristic values, if a larger characteristic value is met, replacing the maximum value, and continuously pooling backwards by using a new maximum value until the pooling of the characteristic vectors in the direction is finished;

adding the maximum pooling results of left pooling and upward pooling, wherein the added result is top left corner pooling;

adding the maximum pooling results of right pooling and downward pooling, wherein the added result is right-lower corner pooling;

and cascading the pooling results of the corner points at the upper left corner and the corner points at the lower right corner to obtain a ground plane rectangular feature map after corner pooling.

Further, the single view angle detection module includes: the single-view characteristic graph point pooling unit and the single-view detection unit are used for detecting the single-view characteristic graph points;

the single-view feature map point pooling unit: the single-view angle detection unit is used for performing pooling of the corner points at the upper left corner and pooling of the corner points at the lower right corner on the feature maps of all the view angles respectively and outputting the feature maps to the single-view angle detection unit;

for each pooling vector, the angular point pooling mode is maximal pooling in a certain direction, adaptive attenuation optimization is adopted for the maximal pooling, and an attenuation formula is as follows:

wherein w is the size of the feature value after corner pooling for performing adaptive attenuation, λ is the attenuation coefficient, step represents the distance from the current maximum feature value, w₀Is the current maximum eigenvalue;

the single-view detection unit; and respectively carrying out single-view target detection on the output results of the single-view feature map point pooling units, and projecting a plurality of single-view target detection results onto a bird's-eye view according to a projection transformation formula to form a single-view target position mapping map.

Further, the method further comprises: the step of training the intersection multi-view target detection model specifically comprises the following steps:

establishing a data set for training a model; the data set includes: the system comprises a label file set, an image data set and a calibration file set, wherein the label file set comprises a plurality of json files, the image file set comprises a plurality of preprocessed RGB (red, green and blue) images, the json files and the RGB images are in one-to-one correspondence, and the calibration file set comprises an internal reference file, an external reference file and a calibration file relative to a ground plane of each intersection camera;

in the intersection multi-view target detection model, each corner pooling feature layer is provided with corner pooling results of a plurality of targets, in order to establish a relation between each corner pooling result of the targets in different corner pooling feature layers, the pooling results are grouped by using a Pull loss function, and the upper left corner and the lower right corner of each target are in a group; separating the angular points by using a Push loss function due to the independence of each characteristic layer;

the Pull loss function is as follows:

the Push penalty function is as follows:

wherein the content of the first and second substances,

and

as embedding vectors for the top left corner and bottom right corner of the kth target, respectively, e_kIs that

And

Δ is 1, which corresponds to the offset loss;

in the angular point pooling of the single-view feature map, the results of the left upper angular point pooling and the right lower angular point pooling are used as a group of angular points, and the angular points are used as one-dimensional embedded vectors to be added into the training in the network training;

setting the size, batch processing number, training wheel times and learning rate of each wheel of an encoder and a decoder for model training, inputting a data set into the intersection multi-view target detection model, and training the model to obtain the trained intersection multi-view target detection model.

The invention also provides an intersection multi-view target detection system based on angular point pooling, which comprises: an intersection multi-view target detection model, a data preprocessing module and a target detection module,

the data preprocessing module: the system comprises a front camera, a rear camera, a front camera, a rear camera, a front camera, a rear camera and a camera;

the target detection module is used for inputting the preprocessed multi-view camera data into the intersection multi-view target detection model and outputting a target prediction result; the multi-view target detection model is used for extracting the features of the image of the multi-view camera after preprocessing, inputting the extracted features into one path for feature projection, feature fusion and corner pooling, predicting the target position through a ground plane rectangular feature map after corner pooling, inputting the extracted features into the other path for single-view detection and result projection, correcting the target position through a single-view target position mapping map and outputting a target prediction result.

Compared with the prior art, the invention has the advantages that:

1. the method does not need additional post-processing operation, and can accurately complete the multi-view target detection on the premise of ensuring the timeliness;

2. the detection method based on angular point pooling can greatly improve the accuracy of intersection multi-view target detection and better solve the 'ghost' phenomenon from the algorithm level;

3. the invention aims at the possible phenomena of congestion or various target vehicles at the intersection, improves the corner pooling, and improves the precision measurement precision of the congestion or the excessive target vehicles at the intersection by using the corner pooling mode of the attenuation of the activation value.

Drawings

In order to illustrate the invention more clearly, the drawings that are needed for the invention will be briefly described below, it being apparent that the drawings in the following description are some embodiments of the invention, for which other drawings may be derived by those skilled in the art without inventive effort.

FIG. 1 is a schematic diagram of corner pooling of the present invention, wherein the upper left corner pooling is shown;

FIG. 2 is a flow chart of the intersection multi-view target detection method based on corner pooling of the present invention;

FIG. 3 is a simulation diagram of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It is to be understood that the described embodiments are only a few embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Before describing the embodiments of the present invention, the related terms related to the embodiments of the present invention are first explained as follows:

a multi-view camera: the system is characterized in that a plurality of monocular cameras placed at the intersection are distributed on the road side, and the total field angle of the multi-view cameras can cover the whole intersection.

Multi-view image: the method refers to a color image acquired by a multi-view camera, and the color image is a three-channel image.

Label: labels used for the supervised training of the target detection neural network are represented, and the category and the position of each target of the multi-view image are labeled.

The embodiment 1 of the invention provides a corner pooling-based intersection multi-view target detection method, wherein a target is a vehicle, and the method comprises the following specific implementation steps:

step 1) establishing and training a multi-view target detection model of the intersection;

step 101) establishing a multi-view target detection model of the intersection;

the intersection multi-view target detection model comprises: the system comprises a feature extraction module, a multi-view feature projection module, a feature fusion module, a feature map corner pooling module, a single-view detection module and a prediction module;

the characteristic extraction module is used for extracting the characteristics of the multi-view image;

in use, in consideration of light weight and real-time requirements of intersection multi-view detection, a 'bottomplan' is adopted for ResNet50, two convolution layers of 3x3 are replaced by 1x1+3x3+1x1 convolution layers, the convolution layer of the middle 3x3 is reduced in dimension through one convolution layer of 1x1, and then reduction is carried out under the other convolution layer of 1x1, so that the accuracy is maintained, and the calculated amount is reduced. The first convolution of 1x1 reduces the 256-dimensional channels to 64-dimensional and then finally recovers by convolution with 1x 1. Finally, the number of parameters is reduced, and a lighter intersection multi-view target detection model is obtained.

Multi-view feature projection module:

projecting the characteristic diagram of each visual angle to a bird's-eye view plane by using a calibration file of multiple cameras and a perspective transformation principle, wherein the transformation process comprises the following steps:

wherein s is a real number scale factor, u and v are coordinates before projection, and x, y and z are coordinates after projection; p_θIs a 3x 4 angular transformation matrix. A is a 3 × 3 intrinsic parameter matrix. [ R | t]Is a 3 × 4 joint rotation-translation matrix, i.e. an extrinsic parameter matrix in the extrinsic reference file, where R denotes rotation and t denotes translation. For camera N ∈ {1, …, N } and calibration file, we pass the custom sampling grid shape [ H, W ∈ [ ]]The image is projected according to a perspective transformation to the ground plane z-0. Ground plane locations outside the field of view are filled with zeros. Mapping the characteristics of n camerasAnd sequentially projecting according to a perspective transformation formula.

A feature fusion module:

the ground plane location is quantized into a grid of size H W, where H and W specify the length and width of the final generated bird's-eye view. In addition, a 2-channel graph is used to specify the X-Y coordinates of the ground plane location. The projection feature maps of the N cameras output by the multi-view feature projection module are cascaded, and are added with the coordinate feature maps from the 2 channels to obtain a (NxC +2) channel ground plane rectangular feature map, which is also a bird's-eye view feature map at the intersection, wherein the shape of the feature map is [ H, W ], and C is the number of feature channels extracted from the image of each camera.

A feature corner pooling module:

in object detection, the corner points of the bounding box are usually located outside the object, in which case the corner points cannot be located according to local or edge features of the object. From the viewpoint of observing the target by eyes, in order to determine whether the upper-left corner point of the target detection frame exists at a certain pixel position, the topmost boundary of the target needs to be horizontally viewed to the right, and the leftmost boundary of the target needs to be vertically viewed to the bottom, wherein the observation mode is applied to the operation of fusing the feature maps after multi-view projection.

Copying 3 parts of the fused feature map, and performing maximum pooling of all feature vectors in the left direction, the right direction, the upward direction and the downward direction respectively, specifically, in the pooling process of a certain direction, firstly setting the first feature value of the edge of all the feature vectors as a maximum value, if the backward feature value is smaller than the maximum value, performing maximum pooling on the small feature value, if a larger feature value is met, replacing the maximum value, and continuing to perform backward pooling by using a new maximum value until the feature vector pooling in the direction is completed.

And adding left pooling and upward pooling results to the result of angular point pooling, wherein the result is left upper angular point pooling, and adding right pooling and downward pooling results to obtain right lower angular point pooling. Fig. 1 shows a schematic diagram of the process of pooling the top left corner points. And cascading the pooling results of the corner points at the upper left corner and the corner points at the lower right corner to be used as the output of the fused feature map after multi-view projection.

The single viewing angle detection module includes: the single-view characteristic graph point pooling unit and the single-view detection unit are used for detecting the single-view characteristic graph points;

single-view feature map point pooling unit:

the feature graph used for single-view detection comes from the shared features extracted from the multi-view features in the feature extraction module. In order to more accurately extract the vehicle characteristics under the intersection, angular point pooling is also adopted for the single-view characteristic map:

copying 3 parts of the feature map of each single visual angle, and performing maximum pooling of all feature vectors towards the left, the right, the up and the down respectively; and adding the pooling results of left pooling and upward pooling to obtain a pooling result of the corner points at the upper left corner, and adding the pooling results of right pooling and downward pooling to obtain a pooling result of the corner points at the lower right corner.

Furthermore, adaptive attenuation optimization is carried out on the corner pooling mode. In the angular point pooling, for each pooled vector, the angular point pooling mode is maximal pooling in a certain direction. The pooling mode improves the detection of the target corner points, but can cause feature confusion for a plurality of targets, so that the maximum pooling is optimized by self-adaptive attenuation, the maximum pooling value of the current target is prevented from causing interference to a gap between the two targets, and the attenuation formula is as follows:

wherein w is the size of the feature value after corner pooling for performing adaptive attenuation, λ is the attenuation coefficient, step represents the distance from the current maximum feature value, w₀Is the current maximum eigenvalue. Researches show that the self-adaptive attenuation corner pooling not only maintains the detection performance of vehicles with multiple visual angles under the road junction, but also effectively reduces the false detection and detection errors of the workshop gap.

A single-view detection unit; and performing single-view detection on the output result of the point pooling unit of the single-view characteristic diagram, and projecting the single-view detection result to the aerial view according to a projection transformation formula to serve as part of supervision information of intersection multi-view target detection, so as to assist in better obtaining position distribution information of vehicles at the intersection and further improve the detection performance.

And the prediction module is used for predicting the vehicle position information under the intersection by using the output, correcting the position information by using the detection result of the single-view detection unit, and finally outputting the accurate information of the vehicle position projected to the bird's-eye view by the multi-view image under the current intersection.

Step 102) training a multi-view vehicle detection model at the intersection;

establishing a data set for training a model; the data set includes: the system comprises a label file set, an image data set and a calibration file set, wherein the label file set comprises a plurality of json files, the image file set comprises a plurality of RGB images, the json files and the RGB images are in one-to-one correspondence, and the calibration file set comprises internal reference files and external reference files of each data acquisition camera and calibration files relative to a ground plane;

preprocessing the three-channel RGB image; as an input to a neural network model;

in the training process of the intersection multi-view target detection model, each corner pooling feature layer is provided with corner pooling results of a plurality of targets, in order to establish a relation between each corner pooling result of the targets in different corner pooling feature layers, a Pull loss function is used for grouping the pooling results, and the upper left corner and the lower right corner of each target form a group; separating the angular points by using a Push loss function due to the independence of each characteristic layer;

the Pull loss function is as follows:

the Push penalty function is as follows:

wherein the content of the first and second substances,

and

And

Δ is 1, which corresponds to the offset loss;

different from the corner pooling of the multi-view fused feature map, in the corner pooling of the single-view feature map, the results of the left-upper corner pooling and the right-lower corner pooling are used as a group of corners, and the corners are used as one-dimensional embedded vectors to be added into the training in the network training.

Setting the size, batch processing number, training wheel times and learning rate of each wheel of an encoder and a decoder for training the intersection multi-view target detection model, and training the model to obtain the intersection multi-view target detection model.

Step 2) preprocessing the multi-view camera raw data collected in real time, including whitening, denoising and other operations;

and 3) inputting the preprocessed multi-view camera data into a trained intersection multi-view target detection model, firstly extracting features, performing feature projection, feature fusion and corner pooling on the extracted features, outputting vehicle position information, performing single-view detection and result projection at the same time, outputting correction information, correcting the vehicle position information, and outputting an accurate vehicle position prediction result, as shown in FIG. 2.

Vehicle position prediction of multi-view camera data is performed using the method of the present invention, as shown in fig. 3.

Example 2

Embodiment 2 of the present invention provides an intersection multi-view target detection system based on angular point pooling, including: a trained intersection multi-view target detection model, a data preprocessing module and a target detection module,

a data preprocessing module: the system comprises a front camera, a rear camera, a front camera, a rear camera, a front camera, a rear camera and a camera;

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An intersection multi-view target detection method based on corner pooling, the method comprising:

inputting the preprocessed images of the multi-view camera into a pre-established and trained intersection multi-view target detection model, and outputting a target prediction result; the multi-view target detection model is used for extracting the features of the image of the pre-processed multi-view camera, performing feature projection, feature fusion and corner pooling on the extracted features, predicting the target position through a ground plane rectangular feature map after the corner pooling is performed, performing single-view detection and result projection on the extracted features, correcting the target position through a single-view target position mapping map and outputting a target prediction result;

the prediction module is used for predicting the target position by using the ground plane rectangular characteristic image subjected to angular point pooling, correcting the target position by using the single-view detection result of the single-view target position mapping image and outputting a target prediction result;

the specific implementation process of the feature map corner pooling module is as follows:

2. The intersection multi-view target detection method based on corner pooling of claim 1, wherein said feature extraction module uses a ResNet50 network comprising: one 1x1 convolutional layer for dimensionality reduction, one 3x3 convolutional layer, and one 1x1 convolutional layer for recovery dimensions.

3. The intersection multi-view target detection method based on corner pooling of claim 1, wherein the multi-view feature projection module is implemented by the following specific processes:

projecting the feature map of each view onto a bird's eye view plane:

4. The intersection multi-view target detection method based on corner pooling of claim 1, wherein said single-view detection module comprises: the single-view characteristic graph point pooling unit and the single-view detection unit are used for detecting the single-view characteristic graph points;

5. The intersection multi-view target detection method based on corner pooling of claim 1, further comprising: the step of training the intersection multi-view target detection model specifically comprises the following steps:

pull penalty function L_pullThe following were used:

push loss function L_pushThe following were used:

wherein the content of the first and second substances,

and

And

Δ is 1, which corresponds to the offset loss;

6. An intersection multi-view target detection system based on corner pooling, the system comprising: an intersection multi-view target detection model, a data preprocessing module and a target detection module,

the target detection module is used for inputting the preprocessed multi-view camera data into the intersection multi-view target detection model and outputting a target prediction result; the multi-view target detection model is used for extracting the features of the image of the pre-processed multi-view camera, inputting the extracted features into one path for feature projection, feature fusion and corner pooling, predicting the target position through a ground plane rectangular feature map after the corner pooling, inputting the extracted features into the other path for single-view detection and result projection, correcting the target position through a single-view target position mapping map and outputting a target prediction result;