CN114550023A

CN114550023A - Traffic target static information extraction device

Info

Publication number: CN114550023A
Application number: CN202111675958.7A
Authority: CN
Inventors: 闵泉; 郭志杰; 王恩师; 肖嵩松; 杨涛; 邓文慧; 刘卓
Original assignee: Wuhan Cccc Traffic Engineering Co ltd
Current assignee: Wuhan Cccc Traffic Engineering Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-27

Abstract

The invention relates to a traffic target static information extraction device, comprising: the image acquisition module is used for acquiring video images of traffic targets in a traffic scene; the static information extraction module is used for inputting the key frames of the video images into the trained detection network to obtain a plurality of static information of the traffic target; wherein the detection network comprises: the feature extraction unit is used for extracting features of the key frame to obtain a plurality of ROI areas; the combination unit is used for combining the ROI areas to obtain combination characteristics; the multi-branch prediction unit is used for analyzing and processing the combined characteristics to obtain a plurality of static information of the traffic target; and the storage module is used for storing the static information. The invention can acquire abundant static information of the traffic target and complete the task of describing the traffic target from each level of vehicle attributes.

Description

Traffic target static information extraction device

Technical Field

The invention relates to the technical field of traffic target information extraction, in particular to a traffic target static information extraction device.

Background

The key of the vehicle static attribute description is to realize the detection of the target and acquire the position and category information of the target from the two-dimensional image according to the target detection result so as to provide a data basis for establishing the characteristic model. Existing target detection methods can be mainly classified into two categories: a conventional method and a deep learning method. The traditional method mainly depends on manually designed features, is simple in algorithm and small in calculated amount, but generally has the defects of incomplete feature extraction, high false detection rate, poor robustness and the like when the traditional method is influenced by factors such as camera shake, target shielding, illumination change, weather change and the like. By utilizing the deep learning technology, the target detection algorithm has a significant breakthrough. The deep learning method can be divided into two types of frames of two, namely two-stage and one-stage according to the algorithm process. The representative algorithm of the algorithm is Faster R-CNN, the algorithm provides an RPN network to replace an original sliding window mechanism, combines an anchor to perform target area rough positioning, and then performs fine modification on the target position through a bounding box regression classification network, so that the algorithm performance is greatly improved. The representative algorithm of the latter is SSD, the algorithm combines the regression idea with the anchor of the Faster R-CNN, and utilizes the multi-scale feature map of each position after the convolution of the original image to carry out regression and classification, thereby improving the positioning precision of the target area while maintaining the detection speed and having better adaptability to small targets. In order to avoid presetting an anchor, the CenterNet regards the detection of the bounding box as key point detection, uses a central point, an upper left corner point and a lower right corner point as key points, fully utilizes the internal features of a target, and improves the algorithm precision. The current deep learning target detection algorithm is almost built based on the two types of frameworks.

For the multi-task identification problem, the tasks to be completed include: the method comprises the following tasks of license plate extraction and recognition, key frame grabbing, target detection and tracking and the like. The existing method mostly adopts a deep learning network, but the network can only solve a plurality of tasks and can not solve a plurality of tasks through one network, so that how to design a multi-task deep network directly solves the tasks of license plate extraction and identification, key frame grabbing and target detection and tracking is a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a traffic target static information extraction device, which can acquire abundant traffic target static information and complete the task of describing traffic targets from each level of vehicle attributes.

The technical scheme adopted by the invention for solving the technical problems is as follows: provided is a traffic target static information extraction device including:

the image acquisition module is used for acquiring video images of traffic targets in a traffic scene;

the static information extraction module is used for inputting the key frames of the video images into the trained detection network to obtain a plurality of static information of the traffic target; wherein the detection network comprises: the feature extraction unit is used for extracting features of the key frame to obtain a plurality of ROI areas; the combination unit is used for combining the ROI areas to obtain combination characteristics; the multi-branch prediction unit is used for analyzing and processing the combined characteristics to obtain a plurality of static information of the traffic target;

and the storage module is used for storing the static information.

The traffic scene target feature set used by the detection network during training comprises an actual scene target feature set and a simulated scene target feature set, wherein the actual scene target feature set is a traffic scene video obtained by collecting an actual monitoring video, and the simulated scene target feature set is a virtual traffic scene video generated by placing a virtual camera in a virtual traffic scene and recording the virtual camera.

The device for extracting the static information of the traffic target further comprises a key frame extraction module, wherein the key frame extraction module is used for extracting valuable video frames from the video images to serve as key frames.

The key frame extraction module comprises a convolution network layer, a full connection layer and a classification layer which are sequentially connected, wherein the input of the convolution network layer is a video frame to be screened and a previous frame image of the video frame, and the output of the convolution network layer is a one-dimensional feature vector; the full connection layer is used for combining the one-dimensional feature vectors; and the classification layer is used for analyzing the combined one-dimensional characteristic vector to finish the classification of the video frames to be screened.

The feature extraction unit performs convolution processing on an input key frame to generate a group of convolution feature maps, then obtains a group of suggestions based on the convolution feature maps by using a region suggestion network, and obtains a plurality of ROI regions according to the size of the suggestions; wherein the area proposal network is pooled with bilinear kernels using deconvolution.

The multi-branch prediction unit comprises a full connection layer, a three-dimensional prediction branch and a multi-level prediction branch are arranged behind the full connection layer, the three-dimensional prediction branch is used for detecting key points of the vehicle, and a pyramid mechanism is added to adapt to a multi-scale vehicle target; the multi-level prediction branch is used for detecting a target license plate, the vehicle color and the number of vehicle axles so as to obtain different vehicle static characteristics.

The loss of the detection networkThe function is:

therein, loss_totalAs a function of the overall loss, L₁,L₂,L₃,L₄For four loss fractions, P is the standard softmax loss, C is the proposal in the batch, C ═ 1 for the positive case, R is the smooth L1 loss, λ_cls,λ_regThe two-dimensional target rectangular frame regression vectors and the two-dimensional target real regression vectors are obtained by predicting the detection network; lambda [ alpha ]_3dThe regularization constants of the three-dimensional prediction branches are v, v are three-dimensional regression vectors obtained by prediction of the detection network and real three-dimensional regression vectors respectively; lambda [ alpha ]_simRegularization parameters of template similarity degree, wherein t, t are a template vector predicted by the detection network and a real template vector respectively; lambda [ alpha ]_partAnd respectively obtaining a target component vector predicted by the detection network and a real target component vector position for the regularization parameter of component detection.

The storage module also encodes the static information before storing the static information.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: according to the invention, through the designed integrated multi-task deep detection network, the detection network can acquire abundant static information of the traffic target, and complete the task of describing the traffic target from each level of vehicle attributes. In order to improve the calculation speed of the whole network, the invention also judges whether the video frame is a key frame or not through the key frame screening network in advance. The invention also encodes the extracted target information, so that the extracted target information can be used in various environments, and the implementation mode is more flexible.

Drawings

FIG. 1 is a schematic structural diagram of an embodiment of the present invention;

FIG. 2 is a block diagram of a key frame extraction module according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a detection network according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of memory module encoding according to an embodiment of the present invention.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

An embodiment of the present invention relates to a traffic target static information extraction device, as shown in fig. 1, including: the image acquisition module is used for acquiring video images of traffic targets in a traffic scene; the static information extraction module is used for inputting the key frames of the video images into the trained detection network to obtain a plurality of static information of the traffic target; and the storage module is used for storing the static information.

In this embodiment, the apparatus for extracting static information of a traffic target further includes a key frame extraction module, where the key frame extraction module is configured to extract a valuable video frame from the video image as a key frame. As shown in fig. 2, the key frame extraction module includes a convolution network layer, a full connection layer and a classification layer, which are connected in sequence, where the input of the convolution network layer is a video frame to be screened and a previous frame image of the video frame, and the output is a one-dimensional feature vector; the full connection layer is used for combining the one-dimensional feature vectors; and the classification layer is used for analyzing the combined one-dimensional characteristic vector to finish the classification of the video frames to be screened.

In a traffic video sequence, a large number of redundant video frames often exist, and if each video frame is analyzed, the calculation speed is seriously influenced. Therefore, for the original video sequence, the original video sequence is firstly sent to a key frame screening network (namely a key frame extraction module) to extract valuable video frames in the original video sequence for subsequent analysis, thereby improving the processing efficiency. Firstly, a large-scale video frame screening image library needs to be constructed, and the image library covers two types of images through manual labeling: the road surface area image processing method comprises an useless frame (label 0) and a key frame (label 1), wherein the useless frame is a frame with few or no traffic objects in a scene and a frame with a low road surface area ratio in the scene, and the key frame is a frame with a large number of traffic objects in an image and a high road surface area ratio in the image. Secondly, the key frame screening network is designed as follows, a frame image to be screened and a previous frame image (as a comparison reference) are sent into a convolution network to obtain a one-dimensional vector which comprises the characteristics of two images, then the image characteristics are combined by using a full connection layer, and the classification of the image frame is completed by utilizing softmax. The final output is the classification result, i.e. whether the image is a key frame or not.

The detection network in the present embodiment includes: the feature extraction unit is used for extracting features of the key frame to obtain a plurality of ROI areas; the combination unit is used for combining the ROI areas to obtain combination characteristics; and the multi-branch prediction unit is used for analyzing and processing the combined characteristics to obtain a plurality of static information of the traffic target.

The feature extraction unit is used for performing convolution processing on an input key frame to generate a group of convolution feature maps, then obtaining a group of proposals based on the convolution feature maps by using a region proposal network, and obtaining a plurality of ROI regions according to the size of the proposals; wherein the area proposal network is pooled with bilinear kernels using deconvolution.

For a traffic target, static attribute information of the traffic target, such as two-dimensional attribute information, three-dimensional attribute information and the like in a traffic scene needs to be sensed, so that rich and complete target description is provided in a large-range traffic scene with multi-camera linkage. The two-dimensional and three-dimensional static information of the traffic target comprises: the target position, the target category, the target size, the number of vehicle axles, the license plate, the vehicle color and other series of information. The method comprises the steps of screening a network according to a key frame to obtain the key frame, inputting the key frame into a detection network, and outputting the target static information through network operation, wherein the design scheme of the detection network is shown in figure 3. First, a traffic scene video frame is input, a set of convolution feature maps is generated through convolution layers, and a set of proposals is obtained based on the feature maps by using a Region Proposal Network (RPN). The RPN predicts a bounding box that is likely to contain an object. The deconvolution and the bilinear kernel are used for pooling to enlarge the characteristics of the small proposed area, so that the problem of small target detection insensitivity caused by the fact that a small traffic target is represented by repeated values is avoided. The pooling operation is applied in multiple tiers of convolutional neural networks, and the pooled elements of these different convolutional layers are concatenated together to fuse low-level detailed information and high-level semantic information. Then, the network is divided into a plurality of branches according to the size of the proposed area, so that the training burden of traffic targets with different scales and dimensions is reduced, and the detection accuracy of large objects and small objects is improved. Based on the above, after combining the features of the ROI, different branch predictions are used for targets with different scales, and the detection network is divided into three prediction branches, which correspond to a small target, a medium target, and a large target. Adding a full-connection layer into the three prediction branches, then adding a three-dimensional prediction branch into the multi-layer prediction branch, and taking charge of detecting key points (namely vertex positions) of the vehicle, and simultaneously adding a pyramid mechanism to adapt to a multi-scale vehicle target; and meanwhile, multi-level prediction branches for detecting the number plate, the color and the number of vehicle axles of the target are added to obtain different static characteristics of the vehicle. Finally, all of the predicted results from the multiple branches are fused into a final detection result. The loss function of the detection network in this embodiment is as follows:

wherein the overall loss function loss_totalIs composed of four parts, L respectively₁,L₂,L₃And L₄. P is the standard softmax loss, C is the proposal in the lot, C ═ 1 in the positive case, R is the smooth L₁And (4) loss. At L₁In, λ_cls,λ_regRegularization constants for classes and regression, d^*D is a two-dimensional target rectangular frame regression vector and a two-dimensional target frame true regression vector obtained by integral network prediction respectively; at L₂In, λ_3dRegularizing constant, v, for three-dimensional detection of branches^*V is a three-dimensional regression vector and a real three-dimensional regression vector obtained by integral network prediction respectively; at L₃In, λ_simRegularization parameter, t, for template similarity^*T is a template vector and a real template vector which are obtained by integral network prediction respectively; at L₄In, λ_partRegularization parameter, s, for component detection^*And s is the target component vector and the real target component vector position obtained by the integrated network prediction respectively. Through the propagation calculation of the loss function, the static attributes such as the position, the category, the size, the number of vehicle axles, the license plate, the vehicle color and the like of the target in each traffic scene are obtained, and the two-dimensional and three-dimensional detection integrated multi-level description network design of the traffic target is completed. The network can acquire abundant static information of the traffic target and complete the task of describing the traffic target from each level of vehicle attributes.

It is worth mentioning that in order to describe the static characteristics of the target, a large number of target samples need to be collected. For a wide-range traffic scene with multi-camera linkage, the traffic targets are rich in number and variety. Although the large-scale object data sets disclosed so far, such as COCO, PascalVoc data sets, BIT vehicle data sets, which include many common object features, are not sufficient for traffic scenarios. In the embodiment, a large-range traffic scene needs to be considered, the camera shooting range is wide, a target on a road can be severely deformed when driving to or away from the camera, and meanwhile, in order to meet the diversity of the traffic scene and consider the traffic target condition under a complex traffic environment, a plurality of data sets oriented to the traffic scene need to be constructed when a detection network is trained. The traffic scene target feature set constructed in the embodiment is divided into two parts: an actual scene target feature set and a simulated scene target feature set. In the target feature set of the actual traffic scene, the high-definition resolution images of the actual road, bridge and tunnel monitoring cameras at home and abroad are collected, the traffic target features with various scale changes and shape changes are covered, and the problem of insufficient light caused by severe weather conditions such as cloudy days and rainy days is considered. In the simulation traffic scene target feature set, a real scene simulated by an unmanned and automatic driving platform CARLA is used, a virtual camera is placed in a virtual traffic scene, and a virtual traffic scene video is generated by recording the virtual camera. Through video, a lot of scene description information can be acquired: the position angle of the camera, the internal and external parameter matrixes of the camera, the weather in the scene, the crowdedness degree of vehicles in the scene and the like; a plurality of traffic target description information may also be obtained: location, speed, category, number of traffic targets, etc. The embodiment collects more than 300 traffic scenes in the simulated traffic scene target feature set, and more than 4000 traffic videos comprise more than 2000 ten thousand video frames. The target feature set of the actual scene and the simulated scene has complementary advantages, not only considers the real state of the traffic target during operation, but also enriches the diversity of the traffic scene, and provides complete target and scene feature information for the detection network.

The storage module in this embodiment further encodes the static information before storing the static information. Specifically, based on the constructed detection network, since static information of the traffic target such as the target position, the target category, the target size, the number of vehicle axles, the number plate, the vehicle color, and the like is obtained, based on the static information, a feature model belonging to the target is constructed. And unifying the acquired target static attribute information coding format and storing the information coding format into a binary format. The specific encoding format is shown in fig. 4. It can be seen from fig. 4 that the traffic target to be modeled is a red vehicle, and the image position, category, size, number of axles, license plate, and color of the vehicle are obtained by using a two-dimensional and three-dimensional detection integrated multi-level description network of the traffic target, and the information is converted by using an encoder and stored, so that the modeling of the traffic target feature model is completed, and when the specific traffic information of the target is needed, the specific information of the target can be recovered by using a decoder. The traffic target feature model can uniquely describe the static attribute of the target, so that the traffic target feature model can be used for uniquely judging the target in a multi-camera linkage traffic scene, and simultaneously, target information recorded by the traffic target feature model can be used in various environments through encoding and decoding.

As can be easily found, the invention designs an integrated multi-task deep detection network, and the detection network can acquire abundant static information of the traffic target and complete the task of describing the traffic target from each level of vehicle attributes. In order to improve the calculation speed of the whole network, the invention also judges whether the video frame is a key frame or not through the key frame screening network in advance. The invention also encodes the extracted target information, so that the extracted target information can be used in various environments, and the implementation mode is more flexible.

Claims

1. A traffic target static information extraction device, characterized by comprising:

and the storage module is used for storing the static information.

2. The traffic target static information extraction device according to claim 1, wherein the traffic scene target feature set used by the detection network during training includes an actual scene target feature set and a simulated scene target feature set, the actual scene target feature set is a traffic scene video obtained by collecting an actual surveillance video, and the simulated scene target feature set is a virtual traffic scene video generated by placing a virtual camera in a virtual traffic scene and recording the virtual camera.

3. The traffic objective static information extraction device according to claim 1, further comprising a key frame extraction module for extracting a valuable video frame from the video image as a key frame.

4. The traffic target static information extraction device according to claim 3, wherein the key frame extraction module comprises a convolution network layer, a full connection layer and a classification layer which are connected in sequence, the input of the convolution network layer is a video frame to be screened and a previous frame image of the video frame, and the output of the convolution network layer is a one-dimensional feature vector; the full connection layer is used for combining the one-dimensional feature vectors; and the classification layer is used for analyzing the combined one-dimensional characteristic vector to finish the classification of the video frames to be screened.

5. The traffic objective static information extraction device according to claim 1, wherein the feature extraction unit performs convolution processing on an input key frame to generate a set of convolution feature maps, obtains a set of suggestions based on the convolution feature maps by using a region suggestion network, and obtains a plurality of ROI regions according to the size of the suggestions; wherein the area proposal network is pooled with bilinear kernels using deconvolution.

6. The traffic objective static information extraction device according to claim 1, wherein the multi-branch prediction unit comprises a full connection layer, and a three-dimensional prediction branch and a multi-level prediction branch are arranged behind the full connection layer, the three-dimensional prediction branch is used for detecting key points of a vehicle, and a pyramid mechanism is added to adapt to a multi-scale vehicle objective; the multi-level prediction branch is used for detecting a target license plate, the vehicle color and the number of vehicle axles so as to obtain different vehicle static characteristics.

7. The traffic objective static information extraction device according to claim 1, wherein the loss function of the detection network is:

8. The traffic objective static information extraction device according to claim 1, wherein the storage module further encodes the static information before storing the static information.