CN114898243A

CN114898243A - Traffic scene analysis method and device based on video stream

Info

Publication number: CN114898243A
Application number: CN202210291408.3A
Authority: CN
Inventors: 闫军; 丁丽珠; 王艳清
Original assignee: Super Vision Technology Co Ltd
Current assignee: Super Vision Technology Co Ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-08-12

Abstract

The application discloses a traffic scene analyzing method and device based on video streaming. The method comprises the following steps: inputting a plurality of key frame images into a backbone network for key frame feature extraction to obtain a plurality of key frame feature images; inputting the key frame image with the closest time distance of each non-key frame image, the corresponding key frame feature map and each non-key frame image into a feature mapping network for feature mapping to obtain the non-key frame mapping feature map corresponding to each non-key frame image; inputting the key frame feature map and the non-key frame mapping feature map into a semantic division branch network and an example division branch network for prediction to obtain a predicted panorama division result of the video stream; and constructing a spatial feature constraint loss function according to the predicted panorama segmentation result and the real panorama segmentation result, and constructing a temporal feature constraint loss function according to each non-key frame feature map and the corresponding non-key frame mapping feature map.

Description

Traffic scene analysis method and device based on video stream

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a traffic scene parsing method and apparatus based on video streaming.

Background

With the development of high-level video technology, data acquisition is carried out by installing a high-level video camera at the roadside. The acquired data can be subjected to various tasks such as vehicle detection, license plate detection, pedestrian detection, lane line detection, parking line detection, passable region division, roadside green plant division and the like by using a visual algorithm, so that the function of comprehensively analyzing the traffic scene of the region monitored by the whole high-position camera is realized. Through analyzing the traffic scene, the method is favorable for realizing more precise and accurate management of the parking on the road side, and can assist in providing more evidences about vehicle violation, irregular driving and the like. Therefore, by analyzing the traffic scene, the traffic flow can be applied and managed more intelligently, and the traffic flow management system has positive promoting effects on various aspects of urban traffic management, driving safety and the like.

In a traditional traffic scene analysis method, single scene analysis is performed based on a single frame image, and analysis results of a plurality of single frame images are fused to realize analysis of a traffic scene. However, for the video stream, the single-frame images are analyzed and fused one by the conventional method, which may cause the finally obtained traffic scene analysis result to be inconsistent, and the problem of discarding the characteristic information exists, so that the traffic scene cannot be accurately and comprehensively analyzed.

Content of application

The method aims to solve the technical problems that traffic scene analysis results are not consistent and the traffic scene cannot be accurately and comprehensively analyzed due to the fact that a traditional method is adopted. In order to achieve the above object, the present application provides a traffic scene parsing method and apparatus based on video stream.

The application provides a traffic scene analysis method based on video streaming, which comprises the following steps:

acquiring a video stream, and carrying out panoramic segmentation and labeling on the video stream to obtain a real panoramic segmentation result of the video stream;

extracting key frames of the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, inputting the plurality of key frame images into a backbone network of the panoramic segmentation model for key frame feature extraction to obtain a plurality of key frame feature images;

inputting the key frame image with the closest time distance to each non-key frame image, the key frame feature map corresponding to the key frame image and each non-key frame image into a feature mapping network of the panoramic segmentation model for feature mapping to obtain a non-key frame mapping feature map corresponding to each non-key frame image;

Inputting the key frame feature map corresponding to the key frame image with the closest time distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into a semantic segmentation branch network and an example segmentation branch network of the panoramic segmentation model for prediction to obtain a predicted panoramic segmentation result of the video stream;

constructing a spatial feature constraint loss function according to the predicted panorama segmentation result and the real panorama segmentation result, and constructing a temporal feature constraint loss function according to a non-key frame feature map corresponding to each non-key frame image and the non-key frame mapping feature map;

and training and optimizing the panoramic segmentation model according to the spatial feature constraint loss function and the temporal feature constraint loss function to obtain a trained panoramic segmentation model, and analyzing a traffic scene of a video stream to be tested according to the trained panoramic segmentation model.

In one embodiment, the inputting the key frame image with the closest temporal distance to each non-key frame image, the key frame feature map corresponding to the key frame image, and each non-key frame image into a feature mapping network of the panorama segmentation model for feature mapping to obtain a non-key frame mapping feature map corresponding to each non-key frame image includes:

Inputting the key frame image with the closest time distance to each non-key frame image and each non-key frame image into an optical flow neural network to obtain a characteristic optical flow graph;

and performing feature mapping on each feature light flow graph and the key frame feature graph corresponding to the key frame image with the closest time distance to obtain the non-key frame mapping feature graph corresponding to each non-key frame image.

In an embodiment, the feature mapping is performed on the key frame feature map corresponding to the key frame image with the closest temporal distance to each of the feature light-flow maps to obtain the non-key frame mapping feature map corresponding to each of the non-key frame images, where the non-key frame mapping feature map corresponding to the ith non-key frame image is represented by f _i (p)＝G(q,p+δp)f _k (q)；

Wherein f is _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents a feature at a position q of a k-th key frame image closest in time distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift in the k-th key frame image at position q mapped to position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) A characteristic light flow diagram representing the correspondence of the ith non-key frame image and the kth key frame image, I _k Representing the k-th key frame image, I _i Representing the ith non-key frame image, and f is the optical flow neural network.

In an embodiment, in the constructing a temporal feature constraint loss function according to the non-key frame feature map and the non-key frame mapping feature map corresponding to each non-key frame image, the temporal feature constraint loss function is:

wherein,

representing the non-key frame map feature map, y representing the non-key frame feature map, and N representing the number of frames of images in the video stream.

In one embodiment, the inputting the key frame feature map corresponding to the key frame image with the closest temporal distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into a semantic segmentation branch network and an example segmentation branch network of the panorama segmentation model for prediction to obtain a predicted panorama segmentation result of the video stream includes:

inputting each key frame feature map and each non-key frame mapping feature map into a semantic segmentation branch network of the panoramic segmentation model for semantic prediction to obtain a plurality of prediction-incomparable label categories of the video stream;

Inputting each key frame feature map and each non-key frame mapping feature map into an example segmentation branch network of the panoramic segmentation model for example prediction, and obtaining a plurality of predicted variable target detection frame types, a plurality of predicted variable target detection frame positions and a plurality of predicted variable target detection frame binary masks of the video stream;

and performing fusion processing on the plurality of prediction non-numerable target types, the plurality of prediction numerable target detection frame positions and the plurality of prediction numerable target detection frame binary masks to obtain a prediction panorama segmentation result of the video stream.

In one embodiment, the extracting key frames from the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and inputting the plurality of key frame images into a backbone network of the panorama segmentation model to perform key frame feature extraction to obtain a plurality of key frame feature maps includes:

extracting every K frames in the video stream to obtain a plurality of key frame images and a plurality of non-key frame images;

inputting the plurality of key frame images into the backbone network for feature extraction to obtain a plurality of key frame feature maps;

Wherein K is a positive integer in the range of 3 to 8.

In one embodiment, the present application provides a traffic scene parsing apparatus based on a video stream, including:

the data acquisition module is used for acquiring a video stream, and performing panoramic segmentation and labeling on the video stream to obtain a real panoramic segmentation result of the video stream;

the feature extraction module is used for extracting key frames of the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and inputting the plurality of key frame images into a backbone network of the panoramic segmentation model to extract key frame features to obtain a plurality of key frame feature maps;

a feature mapping module, configured to input the key frame image with the closest temporal distance to each non-key frame image, the key frame feature map corresponding to the key frame image, and each non-key frame image into a feature mapping network of the panoramic segmentation model for feature mapping, so as to obtain a non-key frame mapping feature map corresponding to each non-key frame image;

a prediction module, configured to input the key frame feature map corresponding to the key frame image with the closest temporal distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into a semantic segmentation branch network and an instance segmentation branch network of the panorama segmentation model for prediction, so as to obtain a predicted panorama segmentation result of the video stream;

A loss function construction module, configured to construct a spatial feature constraint loss function according to the predicted panorama segmentation result and the real panorama segmentation result, and construct a temporal feature constraint loss function according to a non-key frame feature map corresponding to each non-key frame image and the non-key frame mapping feature map;

and the model generation module is used for training and optimizing the panoramic segmentation model according to the spatial characteristic constraint loss function and the temporal characteristic constraint loss function to obtain a trained panoramic segmentation model, and analyzing a traffic scene of the video stream to be tested according to the trained panoramic segmentation model.

In one embodiment, the feature mapping module comprises:

the characteristic optical flow graph acquisition module is used for inputting the key frame image with the closest time distance to each non-key frame image and each non-key frame image into an optical flow neural network to obtain a characteristic optical flow graph;

a non-key frame mapping feature map obtaining module, configured to perform feature mapping on each feature light flow map and the key frame feature map corresponding to the key frame image closest to the time distance, so as to obtain the non-key frame mapping feature map corresponding to each non-key frame image.

In one embodiment, in the non-key frame mapping feature map obtaining module, the non-key frame mapping feature map corresponding to the ith non-key frame image is represented by f _i (p)＝G(q,p+δp)f _k (q)；

Wherein, f _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents a feature at a position q of a k-th key frame image closest in time distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift in the k-th key frame image at position q mapped to position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) A characteristic light flow diagram representing the correspondence of the ith non-key frame image and the kth key frame image, I _k Representing the k-th key frame image, I _i Representing the ith non-key frame image, and f is the optical flow neural network.

In one embodiment, in the loss function building module, the time characteristic constraint loss function is:

wherein,

In one embodiment, the prediction module comprises:

the semantic segmentation module is used for inputting each key frame feature map and each non-key frame mapping feature map into a semantic segmentation branch network of the panoramic segmentation model for semantic prediction to obtain a plurality of prediction-incomparable target categories of the video stream;

An example segmentation module, configured to input each key frame feature map and each non-key frame mapping feature map into an example segmentation branch network of the panoramic segmentation model to perform example prediction, and obtain multiple predicted probable target detection frame classes, multiple predicted probable target detection frame positions, and multiple predicted probable target detection frame binarization masks of the video stream;

and the fusion module is used for fusing the plurality of prediction non-numerable target types, the plurality of prediction numerable target detection frame positions and the plurality of prediction numerable target detection frame binary masks to obtain a prediction panorama segmentation result of the video stream.

In one embodiment, the feature extraction module comprises:

an image extraction module, configured to extract every K frames in the video stream to obtain the multiple key frame images and the multiple non-key frame images;

a key frame feature map acquisition module, configured to input the multiple key frame images into the backbone network for feature extraction, so as to obtain multiple key frame feature maps;

wherein K is a positive integer in the range of 3 to 8.

The traffic scene analysis method and the traffic scene analysis device based on the video stream can assign a category to each pixel in the video stream, can segment the pixel region contained in each example object, and can distinguish all visible contents in the view. By the traffic scene analysis method based on the video stream, the plurality of key frame images and the plurality of non-key frame images in the video stream can be smoothly connected together, the relevance among the frame images in the video stream is enhanced, the inter-frame information in the video stream is fully fused, and accordingly fine, accurate, smooth and continuous scene analysis of the video stream is achieved. Therefore, the traffic scene analyzing method based on the video stream can realize the function of carrying out all-around traffic scene analysis on the area monitored by the whole high-level camera, is beneficial to realizing more precise and accurate management on road side parking, and improves the management performance of urban traffic management, driving safety and other aspects.

Drawings

Fig. 1 is a schematic flowchart illustrating steps of a traffic scene parsing method based on video streaming according to the present application.

Fig. 2 is a schematic structural diagram of a traffic scene parsing apparatus based on video streams provided in the present application.

Detailed Description

The technical solution of the present application is further described in detail by the accompanying drawings and embodiments.

Referring to fig. 1, the present application provides a traffic scene parsing method based on video streams, including:

s10, acquiring a video stream, and carrying out panorama segmentation labeling on the video stream to obtain a real panorama segmentation result of the video stream;

s20, extracting key frames of the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and inputting the plurality of key frame images into a backbone network of the panoramic segmentation model to extract key frame characteristics to obtain a plurality of key frame characteristic images;

s30, inputting the key frame image with the closest time distance of each non-key frame image, the key frame feature map corresponding to the key frame image and each non-key frame image into a feature mapping network of the panoramic segmentation model for feature mapping to obtain the non-key frame mapping feature map corresponding to each non-key frame image;

s40, inputting the key frame feature map corresponding to the key frame image with the closest time distance of each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into the semantic segmentation branch network and the example segmentation branch network of the panorama segmentation model for prediction to obtain the predicted panorama segmentation result of the video stream;

S50, constructing a spatial feature constraint loss function according to the predicted panorama segmentation result and the real panorama segmentation result, and constructing a temporal feature constraint loss function according to a non-key frame feature map and a non-key frame mapping feature map corresponding to each non-key frame image;

and S60, training and optimizing the panoramic segmentation model according to the spatial characteristic constraint loss function and the temporal characteristic constraint loss function to obtain a trained panoramic segmentation model, and analyzing the traffic scene of the video stream to be tested according to the trained panoramic segmentation model.

In S10, a traffic scene video stream is acquired based on the high-order video data. And the video stream is disassembled into images frame by frame, and panoramic segmentation and labeling are carried out on the video stream to obtain a real panoramic segmentation result corresponding to the video stream. The high-order video data may be obtained by a high-order video camera. The high-order video camera is a camera installed on the road side and used for shooting traffic scene information. The real panorama segmentation result may include a category of the countable objects, and an instance of each countable object. The class of each non-countable object, the class of each countable object, and the instance of each countable object may be distinguished by assigning different IDs. The category of the countable object can be the category corresponding to the countable object such as sky, road, green plant, etc. The category of the countable object can be the category corresponding to the countable object such as a vehicle, a pedestrian, a traffic sign and the like. In an embodiment, when performing panorama segmentation labeling on a video stream, a pixel area included in each countable object may be obtained by a polygon labeling tool, and an ID of each countable object is set to be 0. When the video stream is subjected to panorama segmentation labeling, a pixel area contained in each countable target can be obtained through a polygon labeling tool, the pixel areas of different instances of each countable target are distinguished, and each instance of each countable target is endowed with a different instance ID. Examples are understood to be countable objects, such as vehicles, pedestrians, lane lines, greens, non-motor vehicles, traffic signs, etc.

In S20, feature extraction is performed through the backbone network. The backbone Network comprises a convolution neural Network structure, such as a Visual Geometry Group Network and a ResNet (a basic Network and the like), and performs feature extraction on a plurality of key frame images in the video stream through the backbone Network to obtain a plurality of key frame feature images.

In S30, the feature representation of each key frame feature map acquired in S20 is mapped into other non-key frame images. The feature mapping network may be implemented by an optical flow neural network (FlowNet). And performing feature mapping on a non-key frame feature map and a key frame feature map through a feature mapping network. The time distance between one non-key frame image and one key frame image is the shortest, which can be understood as the shortest interval time between two images. And performing feature mapping through a feature mapping network, so that feature mapping is performed between one non-key frame image with the closest time distance and one key frame image, and features of the key frame images are expressed in feature maps corresponding to the non-key frame images to form corresponding non-key frame mapping feature maps. By constructing the corresponding relation between the key frame image with the closest time distance and the non-key frame image, the relevance between each frame of image in the video stream is enhanced, the subsequent analysis of the video stream is more continuous, and the problem of discarding the characteristic information is avoided.

In S40, a non-key frame image corresponds to a key frame image with the closest temporal distance and a non-key frame mapping feature map, and there is a one-to-one correspondence between the three. The non-key frame mapping feature map characterizes the correlation between the non-key frame images and the key frame images with the closest time distance, and further establishes the time correlation among the frames of the video stream. Therefore, through the mapping characteristic maps of the plurality of non-key frames, the frames in the video stream can be connected, so that the frames can be more coherent. A non-key frame mapping feature map with continuity and a key frame feature map with the closest time distance are input into a semantic segmentation branch network for semantic segmentation, so that the problem of feature information loss is avoided. The non-key frame mapping feature graph with continuity and the key frame feature graph with the closest time distance are input into the example segmentation branch network for example segmentation, and the problem of feature information loss is avoided. By performing semantic segmentation and example segmentation on the non-key frame mapping feature map and the key frame feature map with the closest time distance, a predicted panorama segmentation result corresponding to the video stream can be obtained.

In S50, the spatial feature constraint loss function and the temporal feature constraint loss function form an overall loss function, that is, a loss function of the panorama segmentation model. In S60, the model is trained and optimized for the panorama segmentation model according to the loss function of the panorama segmentation model, so as to obtain optimized model parameters, and further obtain a stable panorama segmentation model. According to the trained panoramic segmentation model, the traffic scene of the video stream to be tested can be analyzed, and a panoramic segmentation result is obtained.

By the traffic scene analysis method based on the video stream, each pixel in the video stream can be assigned with one category, the pixel region contained in each instance object can be divided, and all visible contents in the view can be distinguished. By the traffic scene analysis method based on the video stream, the plurality of key frame images and the plurality of non-key frame images in the video stream can be smoothly connected together, the relevance among the frame images in the video stream is enhanced, the inter-frame information in the video stream is fully fused, and accordingly fine, accurate, smooth and continuous scene analysis of the video stream is achieved. Therefore, the traffic scene analyzing method based on the video stream can realize the function of carrying out all-around traffic scene analysis on the area monitored by the whole high-level camera, is beneficial to realizing more precise and accurate management on road side parking, and improves the management performance of urban traffic management, driving safety and other aspects.

In one embodiment, S20, performing key frame extraction on the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and inputting the plurality of key frame images into a backbone network of the panorama segmentation model to perform key frame feature extraction to obtain a plurality of key frame feature maps, includes:

s210, extracting K frames at intervals in a video stream to obtain a plurality of key frame images and a plurality of non-key frame images;

s220, inputting the plurality of key frame images into a backbone network for feature extraction to obtain a plurality of key frame feature images;

wherein the plurality of key frame images and the plurality of non-key frame images form a video stream, and K is a positive integer ranging from 3 to 8.

In this embodiment, the key frame image is a video frame extracted every K frames in the continuous video frames. A plurality of key frame images are removed from the video stream, and the rest video frames are non-key frame images. It will also be understood that a plurality of key frame images together with a plurality of non-key frame images form a video stream. K is a positive integer in the range of 3 to 8 and can be 3, 4, 5, 6, 7, 8. By setting the value of K, the video stream can be better distinguished and divided by a plurality of key frame images and a plurality of non-key frame images, and the condition that the number of interval frames is too large and the inter-frame relevance is lost is avoided. By performing feature extraction on the video stream at every K frames, the traffic scene analysis method can realize fine, accurate, smooth and coherent scene analysis on the video stream.

In one embodiment, S30, inputting the key frame image with the closest temporal distance to each non-key frame image, the key frame feature map corresponding to the key frame image, and each non-key frame image into a feature mapping network of the panorama segmentation model for feature mapping, and obtaining the non-key frame mapping feature map corresponding to each non-key frame image, includes:

s310, inputting the key frame image with the closest time distance to each non-key frame image and each non-key frame image into an optical flow neural network to obtain a characteristic optical flow graph;

and S320, performing feature mapping on each feature light flow graph and the key frame feature graph corresponding to the key frame image with the closest time distance to obtain a non-key frame mapping feature graph corresponding to each non-key frame image.

In S310, the input of the optical flow neural network is the key frame image with the closest temporal distance to the current non-key frame image. The calculation process of the optical flow neural network comprises a contraction part consisting of convolution layers, a related layer part and an expansion part consisting of deconvolution layers. The contraction part composed of convolution layer is used to extract the characteristic diagram of two frames of images. The correlation layer is used for calculating the correlation between the two characteristic maps. And the expanded part is composed of the deconvolution layer and is used for carrying out optical flow prediction to obtain a characteristic optical flow diagram.

In S320, feature mapping is performed by using the feature light flow graph and the key frame feature map closest to the time, so as to align the features of the key frame closest to the time and propagate the aligned features to the feature map of the corresponding non-key frame image, thereby obtaining an output feature map corresponding to the non-key frame image, that is, the non-key frame mapping feature map.

The non-key frame mapping feature map embodies the feature fusion relationship between the non-key frame images and the key frame images with the closest time distance, so that the time factor is fully considered when the semantic segmentation branch network, the example segmentation branch network and the like are predicted in the follow-up process, the relevance between each frame of image in the video stream is enhanced, the follow-up analysis of the video stream is more continuous, and the problem of feature information discarding is avoided.

In one embodiment, S320, each characteristic light flow graph is compared withPerforming feature mapping on the key frame feature map corresponding to the key frame image with the closest time distance to obtain a non-key frame mapping feature map corresponding to each non-key frame image, wherein the non-key frame mapping feature map corresponding to the ith non-key frame image is represented by f _i (p)＝G(q,p+δp)f _k (q)；

Wherein, f _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents the feature at position q of the k-th key frame image closest in temporal distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift when the position q in the k-th key frame image is mapped to the position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) A characteristic light flow diagram representing the correspondence of the ith non-key frame image and the kth key frame image, I _k Representing the k-th key frame image, I _i Representing the ith non-key frame image, and f is an optical flow neural network.

In this embodiment, feature mapping is performed by bilinear interpolation. In the feature mapping process, according to a known feature light-flow diagram corresponding to the key frame feature map corresponding to the key frame image with the closest time distance and the non-key frame image, the position of a pixel point in the non-key frame image in the key frame image with the closest time distance can be obtained. The characteristic light flow graph of each non-key frame image and the key frame characteristic graph of the key frame image with the closest time distance are subjected to characteristic mapping through bilinear interpolation, so that the non-key frame image and the key frame image with the closest time distance are connected more tightly, the subsequent analysis of the video stream is more continuous, and the problem of discarding of characteristic information is avoided.

In one embodiment, S40, inputting the key frame feature map corresponding to the key frame image with the closest temporal distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into the semantic segmentation branch network and the example segmentation branch network of the panorama segmentation model for prediction, and obtaining a predicted panorama segmentation result of the video stream includes:

s410, inputting each key frame feature map and each non-key frame mapping feature map into a semantic segmentation branch network of the panoramic segmentation model for semantic prediction, and obtaining a plurality of prediction unqualified standard categories of the video stream;

s420, inputting each key frame feature map and each non-key frame mapping feature map into an example division branch network of the panoramic division model for example prediction, and obtaining a plurality of types of predicted variable target detection frames, a plurality of positions of the predicted variable target detection frames and a plurality of binary masks of the predicted variable target detection frames of the video stream;

and S430, carrying out fusion processing on the plurality of types of the predicted noncircular targets, the plurality of types of the predicted noncircular target detection frames, the positions of the plurality of types of the predicted noncircular target detection frames and the binaryzation mask of the plurality of types of the predicted noncircular target detection frames to obtain a predicted panorama segmentation result of the video stream.

In S410, each key frame feature map and non-key frame mapping feature map is input to an upsampling layer of the semantic segmentation branch network for amplification, so as to obtain a plurality of amplified feature maps of the video stream. And performing high-low layer feature fusion on each amplified feature map to obtain a plurality of fused feature maps of the video stream. And classifying the multiple fusion characteristic graphs of the video stream according to the normalized exponential function to obtain multiple prediction non-numerable target classes of the video stream. The multiple prediction countless target classes can also be understood as multiple prediction pixel classes, and represent classes corresponding to countless targets such as sky, roads, green plants and the like.

The semantic segmentation branch network mainly comprises a plurality of up-sampling operations and feature fusion operations. And performing an up-sampling operation on the features of each key frame feature map and the features of the non-key frame mapping feature map so as to restore the feature size to the original image size. And finally, obtaining the classification probability of each pixel point through a Softmax normalization index function, and further obtaining a plurality of prediction countable target categories in the video stream.

In one embodiment, the semantic segmentation branch network includes, but is not limited to, a semantic segmentation network commonly used with FCN, deep Lab, U-Net, and the like. The upsampling operation includes, but is not limited to, recovering the feature size using a deconvolution operation, a bilinear interpolation operation, and the like. And the high-low layer feature fusion interpolates the low-resolution high-layer features to be fused to the resolution which is the same as the high-resolution low-layer features in a bilinear interpolation mode and the like, and then performs feature fusion in an adding or channel dimension splicing mode. By means of high-low layer feature fusion, features in different layers of a network can be fused, high-low layer features can be fused, and accordingly segmentation accuracy of semantic segmentation is improved. The feature fusion of high and low layers includes, but is not limited to, feature fusion methods using FPN (feature fusion network), ASPP (atomic Spatial Pooling fusion), and the like. The Softmax normalization index function is a multi-classification calculation function, the prediction result of each point is converted into the probability of classification, and the calculation formula is as follows:

Wherein z is _c The value of the output is represented by,

the expression that the prediction result is converted to the exponential function ensures the non-negativity of the probability,

the classification probability is obtained by dividing the prediction result after exponential transformation of each pixel by the sum of the results after exponential transformation of all pixels, namely the percentage of the result after exponential transformation of each pixel in the total number.

In S420, a plurality of feature maps are input into the area extraction network layer, and a plurality of target detection frames are obtained. And inputting a plurality of target detection frames into the detection and mask prediction layer to obtain the category, the position and the binary mask of each target detection frame. And the prediction of the instance is realized by constructing an instance division branch network. In one embodiment, the instance split branch network may employ Mask R-CNN instance split algorithm as an instance split branch network architecture. Mask R-CNN belongs to a two-stage example segmentation method. First, each keyframe feature map and each non-keyframe mapped feature map input region extract network rpn (region pro social network) to generate a plurality Of region box candidates roi (region Of interest). Then, the obtained plurality of candidate region blocks are input to the detection and mask prediction layer to generate category information of each ROI and position information of its detection block and a binary mask of each ROI. By extracting the network RPN by the region, candidate boxes for a plurality of targets can be generated. The target detection frame may also be understood as a target candidate frame. And adjusting the generated candidate frames of the plurality of targets to be the same size through the candidate frame alignment layer. The detection and mask prediction layer comprises two branches, wherein the detection branch realizes prediction of the category and the position of the candidate region frame, and the mask branch realizes pixel-by-pixel semantic segmentation of the candidate region frame to obtain the binarization mask of the ROI. The binarization mask indicates that, for the candidate region box, the region pixel value belonging to the background is predicted to be 0 and the region pixel value belonging to the target instance is predicted to be 1.

In S430, feature fusion processing is performed on output results of two branches of the semantic segmentation branch network and the example segmentation branch network, so as to obtain panorama segmentation output. When the feature fusion processing is carried out, the problem of pixel distribution conflict can be solved by solving the pixel points with pixel conflict through a heuristic algorithm (heuristic algorithm).

In one embodiment, in S50, the total loss function is formed by the spatial feature constraint loss function and the temporal feature constraint loss function. Model optimization and parameter updating are carried out by constructing a total loss function of model training.

In this embodiment, the difference between the predicted panorama segmentation result and the real panorama segmentation result is calculated according to the predicted panorama segmentation result and the real panorama segmentation result, and the goal is to reduce the difference between the predicted panorama segmentation result and the real panorama segmentation result. The time characteristic constraint loss function for constraining the characteristics between the key frames and the non-key frames in the time sequence aims at calculating the difference between each non-key frame image and the non-key frame mapping characteristic image corresponding to each predicted non-key frame image so as to obtain a stable and consistent panoramic video segmentation result as far as possible.

The total loss function L may be expressed as L ═ α ₁ L _spatial +α ₂ L _temporal 。

Wherein alpha is ₁ And alpha ₂ Represents a weight coefficient and can be set to 1, L _spatial Representing a spatial feature constraint loss function, L _temporal Representing a temporal feature constraint loss function.

In one embodiment, in S50, the temporal feature constraint loss function is constructed according to the non-key frame feature map corresponding to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image, where the temporal feature constraint loss function is:

wherein,

representing a non-key frame map feature map, y representing a non-key frame feature map, and N representing the number of frames of an image in the video stream.

L _temporal Representing a temporal feature constraint loss function. The temporal feature constraint loss function employs an L2 norm loss function. The L2 norm loss function, also known as the least squares error loss function, represents minimizing the sum of the squares of the difference of the target value and the estimated value. By adjusting the temporal feature constraint loss function, the difference between the feature map of the predicted non-key frame (which can also be understood as the non-key frame mapping feature map corresponding to the non-key frame image) and the feature map of the real non-key frame can be reduced.

In one embodiment, the real panorama segmentation result includes a plurality of real countable target class, a plurality of real countable target detection frame classes, a plurality of real countable target detection frame positions, and a plurality of real countable target detection frame binarization masks.

In S50, constructing a spatial feature constraint loss function according to the predicted panorama segmentation result and the actual panorama segmentation result, including:

constructing a loss function of a semantic segmentation branch network according to a plurality of real countable target categories and a plurality of prediction countable target categories;

constructing a countable target detection frame classification loss function according to a plurality of predicted countable target detection frame classes and a plurality of real countable target detection frame classes, constructing a countable target detection frame position loss function according to a plurality of predicted countable target detection frame positions and a plurality of real countable target detection frame positions, constructing a countable mask segmentation loss function according to a plurality of predicted countable target detection frame binary masks and a plurality of real countable target detection frame binary masks, and constructing a loss function of an example segmentation branch network according to the countable target detection frame classification loss function, the countable target detection frame position loss function and the countable mask segmentation loss function;

and constructing a spatial feature constraint loss function according to the loss function of the semantic segmentation branch network and the loss function of the example segmentation branch network.

In this example, L _spatial The constraint loss function representing spatial features, including the semantic segmentation branch loss function and the example segmentation branch loss function, can be expressed as L _spatial ＝β ₁ L _semantic +β ₂ L _instance 。

Wherein beta is ₁ And beta ₂ Representing weight coefficients, which can be set to 1, L respectively _semantic Representing a semantically-segmented branch loss function, L _instance An example split branch loss function is shown.

L _semantic And (4) representing a semantic division branch loss function, and adopting a cross entropy loss function.

Wherein,

c denotes the number of classes of the semantic segmentation, P _c Representing the probability of a prediction sample belonging to class c, y _c Representing a true label category distribution.

L _instance Representing an example split branch loss function may be expressed as follows:

L _instance ＝λ ₁ L _cls +λ ₂ L _box +λ ₃ L _mask 。

wherein λ is ₁ 、λ ₂ 、λ ₃ Represents a weight coefficient and can be set to 1, L _cls Representing countable target detection box classification loss function, L _box Representing a loss function of position of countable target detection boxes, L _mask Representing a countable mask segmentation loss function.

The countable target detection box classification loss function can be expressed as follows:

wherein N represents the total number of predicted countable target detection boxes, P _i Representing the probability of predicting a probable object detection box as an object,

and representing the real label corresponding to the predicted countable target detection box.

In one embodiment, the IoU Loss of location function is adopted by the countable target detection box. The IoU Loss function is based on IoU cross-over ratios between predicted probable target detection frame positions and true probable target detection frame positions.

In one embodiment, when performing panorama segmentation labeling on the video stream in S10, labeling the specific outline of each instance by using a polygon labeling method. Obtaining a labeling frame in an approximate mode, and labeling the polygon in a coordinate list (x) _min ,y _min ) And the coordinates are respectively taken as the coordinates of the upper left corner and the lower right corner of the real countable target detection frame and are recorded as G. The predicted probable target detection box is labeled as P, and the corresponding IoU can be expressed as

The ratio of the intersection and union of the two detection boxes is shown. IoU Loss function L _IoU ＝1-IoU。

Countable mask segmentation loss function L _mask Using a binary cross entropy loss function to perform the countingThe formula is calculated as:

where P denotes the probability that the predicted countable target detection frame binary mask is the true countable target detection frame binary mask, and y ═ 0 and y ═ 1 denote cases where the true countable target detection frame binary mask is 0 or 1.

In one embodiment, S60, training and optimizing the panorama segmentation model according to the spatial feature constraint loss function and the temporal feature constraint loss function to obtain a trained panorama segmentation model, and performing traffic scene analysis on the video stream to be detected according to the trained panorama segmentation model, includes:

And acquiring a video stream to be detected, inputting the video stream to be detected into a backbone network, a feature mapping network, a semantic segmentation branch network and an example segmentation branch network of the panoramic segmentation model for prediction, and acquiring a panoramic segmentation result corresponding to the video stream to be detected. The panoramic division result comprises contents such as the type of the countable target, the type of the countable target and the instance of each countable target, and can be applied to a roadside parking management system to achieve more precise and accurate management and improve the management performance of urban traffic management, driving safety and other aspects.

Referring to fig. 2, in an embodiment, the present application provides a traffic scene parsing apparatus 100 based on a video stream, which includes a data obtaining module 10, a feature extracting module 20, a feature mapping module 30, a predicting module 40, a loss function constructing module 50, and a model generating module 60.

The data obtaining module 10 is configured to obtain a video stream, perform panorama segmentation and labeling on the video stream, and obtain a real panorama segmentation result of the video stream. The feature extraction module 20 is configured to perform key frame extraction on the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and input the plurality of key frame images into a backbone network of the panorama segmentation model to perform key frame feature extraction to obtain a plurality of key frame feature maps. The feature mapping module 30 is configured to input the key frame image with the closest temporal distance to each non-key frame image, the key frame feature map corresponding to the key frame image, and each non-key frame image into a feature mapping network of the panorama segmentation model for feature mapping, so as to obtain a non-key frame mapping feature map corresponding to each non-key frame image. The prediction module 40 is configured to input the key frame feature map corresponding to the key frame image with the closest temporal distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into the semantic segmentation branch network and the example segmentation branch network of the panorama segmentation model for prediction, so as to obtain a predicted panorama segmentation result of the video stream. The loss function constructing module 50 is configured to construct a spatial feature constraint loss function according to the predicted panorama segmentation result and the real panorama segmentation result, and construct a temporal feature constraint loss function according to the non-key frame feature map and the non-key frame mapping feature map corresponding to each non-key frame image. The model generating module 60 is configured to train and optimize the panorama segmentation model according to the spatial feature constraint loss function and the temporal feature constraint loss function to obtain a trained panorama segmentation model, and perform traffic scene analysis on the video stream to be detected according to the trained panorama segmentation model.

In this embodiment, reference may be made to the description of S10 in the above embodiment for the related description of the data obtaining module 10. The relevant description of the feature extraction module 20 may refer to the relevant description of S20 in the above embodiment. The description of the feature mapping module 30 may refer to the description of S30 in the above embodiment. The relevant description of the prediction module 40 may refer to the relevant description of S40 in the above embodiment. The related description of the loss function building block 50 may refer to the related description of S50 in the above embodiment. The relevant description of the model generation module 60 may refer to the relevant description of S60 in the above embodiment.

In one embodiment, the feature mapping module 30 includes a feature light flow map acquisition module (not shown) and a non-key frame mapping feature map acquisition module (not shown).

The characteristic optical flow graph acquisition module is used for inputting the key frame image with the closest time distance of each non-key frame image and each non-key frame image into an optical flow neural network to acquire a characteristic optical flow graph. The non-key frame mapping feature map acquisition module is used for performing feature mapping on each feature light flow graph and the key frame feature map corresponding to the key frame image with the closest time distance to acquire the non-key frame mapping feature map corresponding to each non-key frame image.

In this embodiment, reference may be made to the description of S310 in the above embodiment for a description of the characteristic light flow diagram obtaining module. The relevant description of the non-key frame mapping feature map obtaining module may refer to the relevant description of S320 in the above embodiment.

In one embodiment, in the non-key frame mapping feature map obtaining module, the non-key frame mapping feature map corresponding to the ith non-key frame image is represented by f _i (p)＝G(q,p+δp)f _k (q)。

Wherein f is _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents the feature at position q of the k-th key frame image closest in temporal distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift when the position q in the k-th key frame image is mapped to the position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) A characteristic light flow diagram representing the correspondence of the ith non-key frame image and the kth key frame image, I _k Representing the k-th key frame image, I _i Representing the ith non-key frame image, and f is an optical flow neural network.

The relevant description in this embodiment may refer to the relevant description in S320 in the above embodiment.

In one embodiment, in the loss function building block 50, the temporal feature constraint loss function is:

Wherein,

representing non-key frame mapping featuresWhere y represents the non-key frame feature map and N represents the number of frames of the image in the video stream.

In this embodiment, the relevant description of the constraint loss function with respect to the temporal feature may refer to the relevant description in the above embodiments.

In one embodiment, prediction module 40 includes a semantic segmentation module (not labeled), an instance segmentation module (not labeled), and a fusion module (not labeled).

And the semantic segmentation module is used for inputting each key frame feature map and each non-key frame mapping feature map into a semantic segmentation branch network of the panoramic segmentation model for semantic prediction to obtain a plurality of prediction unqualified standard categories of the video stream. The example segmentation module is used for inputting each key frame feature map and each non-key frame mapping feature map into an example segmentation branch network of the panoramic segmentation model for example prediction, and obtaining a plurality of predicted variable target detection frame types, a plurality of predicted variable target detection frame positions and a plurality of predicted variable target detection frame binary masks of the video stream.

The fusion module is used for fusing the plurality of prediction incomputable target classes, the plurality of prediction countable target detection frame positions and the plurality of binarization masks of the prediction countable target detection frames to obtain a prediction panorama segmentation result of the video stream.

In this embodiment, the relevant description of the semantic segmentation module may refer to the relevant description about S410 in the above embodiment. The relevant description of the example segmentation module may refer to the relevant description about S420 in the above embodiments. The relevant description of the fusion module can refer to the relevant description about S430 in the above embodiment.

In one embodiment, the feature extraction module 20 includes an image extraction module (not shown) and a key frame feature map acquisition module (not shown).

The image extraction module is used for extracting K frames at intervals in the video stream to obtain a plurality of key frame images and a plurality of non-key frame images. The key frame feature image acquisition module is used for inputting a plurality of key frame images into the backbone network for feature extraction to obtain a plurality of key frame feature images.

Wherein K is a positive integer in the range of 3 to 8.

In this embodiment, reference may be made to the related description of S210 in the foregoing embodiment for the related description of the key frame feature map obtaining module.

In the various embodiments described above, the particular order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

Those of skill in the art will also appreciate that the various illustrative logical blocks, modules, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, modules, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

The various illustrative logical blocks described in this disclosure may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in the embodiments herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

The above-mentioned embodiments, objects, technical solutions and advantages of the present application are described in further detail, it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present application, and are not intended to limit the scope of the present application, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. A traffic scene analysis method based on video streaming is characterized by comprising the following steps:

acquiring a video stream, and carrying out panorama segmentation and labeling on the video stream to obtain a real panorama segmentation result of the video stream;

extracting key frames of the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, inputting the key frame images into a backbone network of the panoramic segmentation model to extract key frame features, and obtaining a plurality of key frame feature maps;

2. The method according to claim 1, wherein the step of inputting the key frame image with the closest time distance to each non-key frame image, the key frame feature map corresponding to the key frame image, and each non-key frame image into a feature mapping network of the panorama segmentation model for feature mapping to obtain a non-key frame mapping feature map corresponding to each non-key frame image comprises:

3. The traffic scene parsing method based on video stream as claimed in claim 2, wherein, in the feature mapping of each feature light-flow map with the key frame feature map corresponding to the key frame image with the closest temporal distance to obtain the non-key frame mapping feature map corresponding to each non-key frame image, the non-key frame mapping feature map corresponding to the ith non-key frame image is represented as f _i (p)＝G(q,p+δp)f _k (q)；

Wherein, f _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents a feature at a position q of a k-th key frame image closest in time distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift in the k-th key frame image at position q mapped to position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) A characteristic light flow graph, I, representing the correspondence of the ith non-key frame image and the kth key frame image _k Representing the k-th key-frame picture, I _i Representing the ith non-key frame image, and f is the optical flow neural network.

4. The traffic scene parsing method based on video stream as claimed in claim 3, wherein in the construction of the temporal feature constraint loss function according to the non-key frame feature map and the non-key frame mapping feature map corresponding to each non-key frame image, the temporal feature constraint loss function is:

wherein,

5. The method as claimed in claim 1, wherein the step of inputting the key frame feature map corresponding to the key frame image with the closest temporal distance to each non-key frame image and the non-key frame mapping feature map corresponding to each non-key frame image into a semantic segmentation branch network and an example segmentation branch network of the panorama segmentation model for prediction to obtain the predicted panorama segmentation result of the video stream comprises:

6. The method of claim 1, wherein the extracting key frames from the video stream to obtain a plurality of key frame images and a plurality of non-key frame images, and inputting the plurality of key frame images into a backbone network of the panorama segmentation model to perform key frame feature extraction to obtain a plurality of key frame feature maps comprises:

wherein K is a positive integer in the range of 3 to 8.

7. A traffic scene parsing device based on video stream, comprising:

8. The apparatus for parsing traffic scene based on video stream of claim 7, wherein the feature mapping module comprises:

9. The video-stream-based traffic scene parsing method of claim 8, wherein in the non-key frame mapping feature map obtaining module, the non-key frame mapping feature map corresponding to the ith non-key frame image is represented by f _i (p)＝G(q,p+δp)f _k (q)；

Wherein, f _i (P) represents a feature at position P in the ith non-key frame image, f _k (q) represents a feature at a position q of a k-th key frame image closest in time distance to the i-th non-key frame image, G (q, p + δ p) represents bilinear interpolation, δ p ═ F _i→k (P) represents the position shift in the k-th key frame image at position q mapped to position P in the i-th non-key frame image, F _i→k ＝f(I _k ,I _i ) Indicating that the ith non-key frame image corresponds to the kth key frame imageCharacteristic light ray diagram of (1) _k Representing the k-th key frame image, I _i Representing the ith non-key frame image, and f is the optical flow neural network.

10. The video stream-based traffic scene parsing method of claim 9, wherein in the loss function constructing module, the temporal feature constraint loss function is:

wherein,

11. The video stream-based traffic scene parsing method of claim 7, wherein the prediction module comprises:

12. The video stream-based traffic scene parsing method of claim 7, wherein the feature extraction module comprises:

a key frame feature map acquisition module, configured to input the plurality of key frame images into the backbone network for feature extraction, so as to obtain the plurality of key frame feature maps;

wherein K is a positive integer in the range of 3 to 8.