CN117670938A

CN117670938A - Multi-target space-time tracking method based on super-treatment robot

Info

Publication number: CN117670938A
Application number: CN202410125476.1A
Authority: CN
Inventors: 罗江; 喻恺; 陈震; 陈昭彰; 吴传洁; 王新官; 陈鹏; 黄涛; 李欣; 于大龙
Original assignee: Jiangxi Fangxing Technology Co ltd
Current assignee: Jiangxi Fangxing Technology Co ltd
Priority date: 2024-01-30
Filing date: 2024-01-30
Publication date: 2024-03-08
Anticipated expiration: 2044-01-30

Abstract

The invention discloses a multi-target space-time tracking method based on a super-robot, which is characterized in that an integrated multi-level description network is constructed, vehicle targets with different scales are well described in static characteristics under a large-scale traffic scene, the static attributes such as the position, the category, the size and the like of each traffic target are efficiently obtained by combining a traffic scene target characteristic set and a key frame screening network, a model belonging to the traffic target is constructed, a certain traffic target is uniquely described, a CARLA simulation platform is adopted for a traffic monitoring camera, a large-scale traffic monitoring scene data set covering multiple scenes and the traffic targets is generated, complete target and scene characteristic information is provided for subsequent modeling, a multi-target tracking network is designed for video segments, a track model is established, a time-space diagram under a single camera is generated and updated, the fact that the camera cannot fully cover road sections is considered, and the traffic time-space diagram under multiple cameras is fused by utilizing the characteristic models of single vehicles and vehicle topologies under the covered road sections.

Description

Multi-target space-time tracking method based on super-treatment robot

Technical Field

The invention relates to the technical field of highways, in particular to a multi-target space-time tracking method based on a super-treatment robot.

Background

The overload control method is used for controlling the phenomenon of illegal overload of vehicles, the illegal overload transportation of the vehicles induces a large number of road traffic safety accidents, and the illegal overload not only has damage to roads, damage to transportation vehicles, damage to drivers and damage to normal competing commercial environments of transportation markets, but also has the defects of incomplete feature extraction, high false detection rate, poor robustness and the like when the traditional method mainly depends on the characteristics of manual design, has simple algorithm and small calculated amount and faces the factors such as camera shake, target shielding, illumination change, weather change and the like.

Therefore, a deep learning method is required to be introduced, the deep learning method has super-strong self-help learning capability and modeling capability of complex tasks, compared with the traditional method, the deep learning method has a large number of model parameters which can approach complex nonlinear relations, so that the model has stronger expression capability and higher accuracy of an algorithm, and meanwhile, the running speed of the algorithm is higher due to distributed storage and parallel computing technology, but the deep learning method still has the following problems.

(1) A camera full-automatic calibration method based on deep learning. The existing method is often focused on calibration under a general scene, and the specificity of the traffic scene is not considered, for example, the traffic scene comprises lane lines, vehicles and the like, and meanwhile, a data set for automatic calibration of a camera aiming at the traffic scene is still blank.

(2) Convolutional neural networks have the problems that convolutional features are sensitive to scale changes, interesting area pooling damages the feature structures of small objects, reverse propagation errors accumulate in the network training process and the like, and final feature extraction is incomplete due to the problems. Continuous target detection, feature extraction and modeling in traffic scenes still need to be further studied, and especially the detection precision of small targets is improved.

(3) Existing multi-target tracking methods are often divided into three phases, namely: the detection, feature modeling and association matching have serious dependency relationship in the three stages, and few methods are used for realizing the aim of integrated multi-target tracking by predicting the motion parameters of targets.

(4) The target detection network is often focused on positioning and identification, and most of the tasks cannot be completed at the same time, and the traffic static information description not only comprises the positions and classifications of the vehicle targets, but also comprises axles, vehicle colors and the like, so that the static information extraction based on the multi-task network is urgently needed to be studied.

Based on the reasons, the invention discloses a multi-target space-time tracking method based on a super-robot.

Disclosure of Invention

The invention aims to provide a multi-target space-time tracking method based on a super-robot.

The invention aims to solve the problems that: and obtaining more accurate camera calibration parameters. And then redesigning the deep neural network structure, improving the scale sensitivity of the deep neural network structure, and constructing a two-dimensional and three-dimensional detection integrated multi-level description network, so that the vehicle targets with different scales can be better described in static characteristics in a large-scale traffic scene. Meanwhile, combining a traffic scene target feature set and a key frame screening network, efficiently obtaining the static attributes of the position, the category, the size, the vehicle axle number, the license plate, the vehicle color and the like of each traffic target, constructing a feature model belonging to the traffic target, and describing a certain traffic target by using the feature model. Meanwhile, aiming at the traffic monitoring camera, a large-scale traffic monitoring scene data set covering various climatic conditions, traffic scenes and traffic targets is generated through the CARLA simulation platform, and complete target and scene characteristic information is provided for development of subsequent researches. And then, aiming at the video fragment, designing a multi-target tracking algorithm based on a graph network, establishing a track characteristic model, and generating and updating a space-time diagram under a single camera. And finally, considering that the cameras cannot fully cover road sections, and fusing traffic space diagrams under multiple cameras by utilizing the feature models of single vehicles and vehicle topologies under the covered road sections.

A multi-target space-time tracking method based on a super-robot comprises the following steps:

s1, establishing a traffic scene target feature set;

s2, screening key frames by a key frame screening network;

s3, constructing an integrated detection network, and integrally describing two-dimensional and three-dimensional attributes of traffic targets in multiple layers;

s4, modeling traffic target characteristics;

s5, constructing a multi-target tracking network to track the multi-target motion parameters;

s6, modeling the dynamic information characteristics of the vehicle.

Further, the traffic scene target feature set established in the step S1 includes an actual scene image feature set and a virtual traffic scene feature set, wherein the actual scene image is obtained by collecting high-definition images of actual road-bridge tunneling monitoring cameras at home and abroad, the virtual traffic scene feature set is simulated by a unmanned and automatic driving platform calla, and a virtual camera is placed in a virtual traffic scene, so that a virtual traffic scene video is generated by recording the virtual camera.

Further, the screening key frame network in S2 constructs a large-scale video frame screening image library, and sends the images to be screened and the previous frame image into the video frame screening image library to a convolution network to obtain a one-dimensional vector, the one-dimensional vector contains the features of two images, then the full-connection layer combines the image features and classifies the image features by softmax, and then a key frame result is output, and the video frame screening image library covers the two types of images by manual labeling: the system comprises useless frames and key frames, wherein the useless frames comprise a useless frame tag 0 and a key frame tag 1, the useless frames comprise frames with fewer traffic targets, frames without traffic targets and frames with road surface areas with lower occupation in images, and the key frames comprise frames with a large number of traffic targets in images and frames with road surface areas with higher occupation in images.

Further, the integrated detection network in S3 includes:

s31, inputting a traffic scene video frame, and generating a convolution feature map through a convolution layer;

s32, generating a group of suggestions based on the convolution feature map by using the regional proposal network RPN;

s33, predicting a boundary box possibly containing an object by using an area proposal network RPN, pooling by utilizing deconvolution and bilinear kernels, expanding small proposal area characteristics, avoiding the problem of insensitivity of a small target caused by representing the small traffic target by repeated values, and applying pooling operation to serially fusing pooling elements positioned at different convolution layers in a plurality of layers of a convolution neural network to fuse low-layer detailed information and high-layer semantic information;

s34, dividing the network into a plurality of branches according to the size of the proposal area, reducing the training burden of traffic targets with different scales and dimensions, and improving the detection precision of large objects and small objects;

s35, after combining the characteristics of the ROI area, predicting targets with different scales by adopting three prediction branches, wherein the three prediction branches respectively correspond to a small target, a medium target and a large target;

s36, adding a full-connection layer into the three prediction branches, adding a three-dimensional prediction branch into the multi-level prediction branch, which is responsible for detecting key points of a vehicle, adding a pyramid mechanism to adapt to a multi-scale vehicle target, and simultaneously introducing the multi-level prediction branches of the license plate, the vehicle color and the vehicle axle number of the detection target to obtain different vehicle static characteristics;

s37, fusing all prediction results of the branches to obtain a final detection result.

Further, the loss function of the integrated detection network is as follows:

；

wherein the overall loss functionConsists of four parts, namely->Loss of standard softmax, +.>For the proposal in the batch, propose +.when the proposal is positive>，/>To smooth L1 loss, in +.>In (I)>Regularization constants for category and regression, +.>The regression vector is a two-dimensional target rectangular frame regression vector obtained by integral network prediction and a two-dimensional target frame true regression vector; at->In (I)>Regularized constant for three-dimensional detection branch, +.>Respectively obtaining three-dimensional regression vectors and true three-dimensional regression vectors by integral network prediction; at->In (I)>Regularization parameters for template similarity +.>Respectively obtaining a template vector and a real template vector by integrated network prediction; at->In (I)>Regularization parameters for component detection, +.>And respectively predicting the obtained target component vector and the actual target component vector position for the integrated network.

Further, the modeling of the traffic target feature model in the step S4 is based on the target static attribute acquired in the step S4, a feature model belonging to a corresponding target is constructed, and the acquired target static attribute is stored in a unified information coding format as a binary format.

Further, the multi-target tracking network is constructed based on a deep motion modeling network DMM-Net, and the specific structure is as follows: inputting a time stampTo->Several frames of images in between, a video frame sequence using oneThe feature extractors are convolved, then 6 layers of features are selected from the rest feature extractors and respectively input into a motion information network, a classification network and a visualization network, and the loss functions of the motion information network, the classification network and the visualization network are respectively used->、/>And->The loss function of the entire network is represented by the following formula:

wherein N is the number of positive anchor pipes; alpha and beta are super parameters, and are used for manually adjusting the learning rate.

Further, in the step S6, the modeling of the dynamic information features of the vehicle constructs an information feature model based on the three-dimensional motion trail of the video vehicle and the timestamp information acquired by the multi-target tracking network, and the information features include a timestamp, an image position, a real position, an average speed, a category and a color.

The invention has the beneficial effects that: the invention utilizes a three-dimensional machine vision technology to extract wagon bottom information in the process of overload treatment, and based on a constructed two-dimensional and three-dimensional detection integrated multi-level description network of a traffic target, static attribute information such as the position, the category, the size, the number of vehicle axles, license plates, vehicle colors and the like of the target is obtained, a characteristic model belonging to the vehicle target is constructed, and the obtained static attribute of the target is uniformly coded; the method comprises the steps of constructing an information characteristic model by using three-dimensional motion trail and time stamp information of a vehicle target in a video;

the application innovation of multi-source data fusion is realized, and traffic volume information such as vehicle speed, vehicle type, flow, vehicle head distance, vehicle head time interval, vehicle following percentage and the like is analyzed and obtained on the basis of vehicle basic information acquisition. When the vehicles are queued to pass through the detection area, the car queue is inserted, no license plate or license plate identification error exists, the vehicles are backed up and the like, the acquired positions and the traffic flow states of the vehicles ensure that the license plate, weight, wheel axle, outline, photo, video and other data of the same vehicle are matched and summarized into a driving record, and the problems of multi-source data matching and vehicle queue information uploading are solved;

the whole process monitoring of truck weighing is realized, the information including the speed of the truck at the place of loading, the weight balance, the sliding edge, the abnormal acceleration, the dynamic vehicle separation and the like can be monitored and provided, the single truck weighing data strip is realized, the data and the weighing interference are avoided, and more complete evidence is provided for the super business.

Drawings

FIG. 1 is a schematic diagram of a key frame screening network architecture according to the present invention;

FIG. 2 is a schematic diagram of an integrated detection network architecture according to the present invention;

FIG. 3 is a schematic diagram of a modeling and encoding flow of a traffic target feature model according to the present invention;

FIG. 4 is a schematic diagram of a DMM-Net network architecture according to the present invention.

Detailed Description

The present invention will be further described more fully hereinafter, but the scope of the invention is not limited thereto.

s1, establishing a traffic scene target feature set;

s2, screening key frames by a key frame screening network;

s4, modeling traffic target characteristics;

s6, modeling the dynamic information characteristics of the vehicle.

Further, the large-scale target data sets presently disclosed, such as COCO, pascalVoc data sets, BIT vehicle data sets, which include numerous common item features, are not satisfactory for traffic scenarios and the present research problem. Because a large-scale traffic scene needs to be considered in the research problem, the shooting range of a camera is wider, a target on a road can generate severe deformation when the target is driven to the camera and driven from the camera, and meanwhile, in order to meet the diversity of the traffic scene and consider the traffic target condition under a complex traffic environment, a plurality of data sets facing the traffic scene need to be constructed, so that the traffic scene target feature set established in the S1 needs to comprise an actual scene image feature set and a virtual traffic scene feature set, wherein the actual scene image covers traffic target features with various scale changes and shape changes by collecting high-definition images of actual road bridge tunneling monitoring cameras at home and abroad, and considers the problem of insufficient light caused by severe weather conditions such as overcast and rainy days, and the like, the virtual traffic scene feature set generates virtual traffic scene videos by recording the virtual cameras in the virtual traffic scene through unmanned and automatic driving platform CARLA simulation, and the virtual traffic scene feature set can acquire a plurality of scene description information through videos: the position angle of the camera, the internal and external parameter matrix of the camera, the weather in the scene, the crowding degree of the vehicles in the scene and the like; numerous traffic objective description information may also be obtained: the position, speed, type, number and the like of the traffic targets collect 300 more traffic scenes in the simulated traffic scene target feature set, and 4000 more traffic videos comprise 2000 tens of thousands of video frames. The actual scene is complementary with the simulated scene target feature set in advantage, so that the real state of the traffic target in operation is considered, the diversity of the traffic scene is enriched, and complete target and scene feature information is provided for development of subsequent research.

Further, in traffic video sequences, there are often a large number of redundant video frames, and if each video frame is analyzed, the calculation speed is seriously affected. Therefore, for the original video sequence, firstly, a key frame screening network is sent to provide valuable video frames in the original video sequence, and subsequent analysis is performed, so that the processing efficiency is improved, as shown in fig. 1, in the step S2, the key frame screening network builds a large-scale video frame screening image library, and sends images to be screened and previous frame images in the video frame screening image library into a convolution network to obtain a one-dimensional vector, wherein the one-dimensional vector contains features of two images, then, the full-connection layer combines the image features and classifies the image features by softmax, and then, a key frame result is output, and the video frame screening image library covers the two types of images through manual labeling: the system comprises useless frames and key frames, wherein the useless frames comprise a useless frame tag 0 and a key frame tag 1, the useless frames comprise frames with fewer traffic targets, frames without traffic targets and frames with road surface areas with lower occupation in images, and the key frames comprise frames with a large number of traffic targets in images and frames with road surface areas with higher occupation in images.

Further, as shown in fig. 2, the integrated detection network in S3 includes:

s33, regional proposal network RPN, namely regional generation network prediction, possibly contains a boundary box of an object, and pools by utilizing deconvolution and bilinear kernels, so that small proposal area characteristics are enlarged, the problem of insensitivity of a small target caused by representing a small traffic target by a repeated value is avoided, and pooling operation is applied to serially fusing pooling elements positioned in different convolution layers in a plurality of layers of a convolution neural network with low-layer detailed information and high-layer semantic information;

s35, after combining the characteristics of the ROI region, namely the region of interest, for targets with different scales, predicting by adopting three prediction branches, wherein the three prediction branches respectively correspond to a small target, a medium target and a large target;

Further, the loss function of the integrated detection network is as follows:

；

wherein the overall loss functionConsists of four parts, namely->Loss of standard softmax, +.>For the proposal in the batch, propose +.when the proposal is positive>，/>To smooth L1 loss, in +.>In (I)>Regularization constants for category and regression, +.>The regression vector is a two-dimensional target rectangular frame regression vector obtained by integral network prediction and a two-dimensional target frame true regression vector; at->In (I)>Is three in threeRegularization constant of the dimension detection branch, +.>Respectively obtaining three-dimensional regression vectors and true three-dimensional regression vectors by integral network prediction; at->In (I)>Regularization parameters for template similarity +.>Respectively obtaining a template vector and a real template vector by integrated network prediction; at->In (I)>Regularization parameters for component detection, +.>And respectively predicting the obtained target component vector and the actual target component vector position for the integrated network.

Further, the modeling of the traffic target feature model in S4 is based on the target static attribute acquired in S4, a feature model belonging to the corresponding target is constructed, the acquired target static attribute is stored in a unified information coding format as a binary format, and a specific coding format is shown in fig. 3. As can be seen from fig. 3, the traffic target to be modeled is a red vehicle, the image position, the category, the size, the number of axles, the license plate and the color of the traffic target are obtained by using a two-dimensional and three-dimensional detection integrated multi-level description network of the traffic target, the information is converted by using an encoder and stored, the modeling of the traffic target feature model is completed, and when the specific traffic information of the target is needed, the specific information of the target can be recovered by using a decoder. The traffic target feature model can uniquely describe the static attribute of the target, so that the traffic target feature model can be used for uniquely judging the target in the multi-camera linkage traffic scene, and meanwhile, the target information recorded by the traffic target feature model can be used in various environments through encoding and decoding.

Further, as shown in fig. 4, the multi-target tracking network is constructed based on a deep motion modeling network DMM-Net, and the specific structure is as follows: inputting a time stampTo->The video frame sequence is convolved by a feature extractor, then 6 layers of features are selected from the rest feature extractors and respectively input into a motion information network, a classification network and a visualization network, and the loss functions of the motion information network, the classification network and the visualization network are respectively equal to +.>、And->The loss function of the entire network is represented by the following formula:

wherein N is the number of positive anchor pipes; alpha and beta are super parameters, are used for manually adjusting the learning rate, and realize the multi-target tracking task in the video by constructing the multi-target tracking network.

In the step S6, the vehicle dynamic information feature modeling is based on the three-dimensional motion trail and time stamp information of the video vehicle acquired by the multi-target tracking network, an information feature model is constructed, the information features comprise time stamps, image positions, real positions, average speeds, categories and colors, and the dynamic information of the vehicle in the video scene can be comprehensively and detailed expressed through the feature vectors.

The embodiments of the present invention are disclosed as preferred embodiments, but not limited thereto, and those skilled in the art will readily appreciate from the foregoing description that various extensions and modifications can be made without departing from the spirit of the present invention.

Claims

1. A multi-target space-time tracking method based on a super-robot is characterized by comprising the following steps:

s1, establishing a traffic scene target feature set;

s2, screening key frames by a key frame screening network;

s4, modeling traffic target characteristics;

s6, modeling the dynamic information characteristics of the vehicle.

2. The multi-objective space-time tracking method based on the super-robot treatment system according to claim 1, wherein the traffic scene objective feature set established in the step S1 comprises an actual scene image feature set and a virtual traffic scene feature set, wherein the actual scene image is obtained by collecting high-definition images of actual road-bridge tunnel monitoring cameras at home and abroad, the virtual traffic scene feature set is obtained by simulating a unmanned and automatic driving platform calla, and a virtual camera is placed in a virtual traffic scene, so that virtual traffic scene videos are generated by recording the virtual camera.

3. The multi-objective spatio-temporal tracking method based on super-robot according to claim 1, wherein: the screening key frame network in the S2 is used for constructing a large-scale video frame screening image library, sending images to be screened and the previous frame image into a convolution network to obtain a one-dimensional vector, wherein the one-dimensional vector comprises the characteristics of two images, combining the image characteristics through a full connection layer and classifying by using softmax, and outputting a key frame result, and the video frame screening image library covers the two types of images through manual labeling: the system comprises useless frames and key frames, wherein the useless frames comprise a useless frame tag 0 and a key frame tag 1, the useless frames comprise frames with fewer traffic targets, frames without traffic targets and frames with road surface areas with lower occupation in images, and the key frames comprise frames with a large number of traffic targets in images and frames with road surface areas with higher occupation in images.

4. The multi-objective spatio-temporal tracking method based on super robot as claimed in claim 1, wherein said integrated detection network in S3 comprises:

5. The multi-objective space-time tracking method based on super robot as claimed in claim 4, wherein the loss function of the integrated detection network is:

；

wherein the overall loss functionConsists of four parts, namely->Loss of standard softmax, +.>For the proposal in the batch, propose +.when the proposal is positive>，/>To smooth L1 loss, in +.>In (I)>Regularization constants for category and regression, +.>The regression vector is a two-dimensional target rectangular frame regression vector obtained by integral network prediction and a two-dimensional target frame true regression vector;at->In (I)>Regularized constant for three-dimensional detection branch, +.>Respectively obtaining three-dimensional regression vectors and true three-dimensional regression vectors by integral network prediction; at->In (I)>Regularization parameters for template similarity +.>Respectively obtaining a template vector and a real template vector by integrated network prediction; at->In (I)>Regularization parameters for component detection, +.>And respectively predicting the obtained target component vector and the actual target component vector position for the integrated network.

6. The multi-target space-time tracking method based on the super-robot treatment system according to claim 1, wherein the modeling of the traffic target feature model in the step S4 is based on the target static attribute acquired in the step S4, the feature model belonging to the corresponding target is constructed, and the acquired target static attribute is stored in a unified information coding format as a binary format.

7. The multi-target space-time tracking method based on the super-robot as claimed in claim 1, wherein the multi-target tracking network is constructed based on a deep motion modeling network DMM-Net, and the specific structure is as follows:

inputting a time stampTo->The video frame sequence is convolved by a feature extractor, then 6 layers of features are selected from the rest feature extractors and respectively input into a motion information network, a classification network and a visualization network, and the loss functions of the motion information network, the classification network and the visualization network are respectively equal to +.>、/>And->The loss function of the entire network is represented by the following formula:

；

8. The multi-target space-time tracking method based on the super-robot according to claim 1, wherein the modeling of the dynamic information features of the vehicle in S6 constructs an information feature model based on the three-dimensional motion track of the video vehicle and the timestamp information acquired by the multi-target tracking network, and the information features include a timestamp, an image position, a real position, an average speed, a category and a color.