Disclosure of Invention
The invention aims to provide a depth map estimation method based on a traffic camera, which can complete depth calculation for target distance measurement or positioning by using data acquired by a fixed camera.
In order to solve the technical problems, the invention adopts the technical scheme that: the depth map estimation method based on the traffic camera specifically comprises the following steps:
s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;
s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;
s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame;
s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;
s5: calculating a pixel-based depth value of each frame through a depth neural network for the continuous frame images of the single vehicle tracking video processed in the step S3;
s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;
s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training the depth neural network model in the step S5 according to the constraint of the objective function, so as to obtain a depth computation graph of the traffic camera video.
As a preferred technical solution of the present invention, the step S2 includes the following steps:
s21: segmenting all vehicles of each frame in the traffic video data by using an image semantic segmentation model, and combining the categories except the vehicles;
s22: tracking each vehicle by adopting a template matching method;
s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;
s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;
s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;
s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle;
as a preferred technical solution of the present invention, in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and a calculation formula of the correlation coefficient is:
wherein
The image of the template is represented by,
the difference between the template and the template mean is represented,
the difference between the original image and the original image mean value is shown. Only the vehicle contour is segmented in this step.
As a preferred technical solution of the present invention, the step S3 includes the following steps: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; and the set background color and the target vehicle color form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame can be obtained. The method mainly comprises the step of carrying out background replacement on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, thereby being convenient for the extraction and matching of subsequent feature points.
As a preferred embodiment of the present invention, in step S4, the virtual pose is determined by converting the motion of the vehicle relative to the camera into a virtual motion of the camera;
the method specifically comprises the following steps of:
s41: firstly, extracting the characteristics of a target vehicle in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:
s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:
wherein U and V are orthogonal arrays,
is a singular value matrix;
then:
wherein
Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;
s43: calculating a matched three-dimensional coordinate P of the characteristic point according to a triangulation principle; the method specifically comprises the following steps: is provided with
And the normalized coordinates of the two feature points satisfy the following conditions:
wherein
Is the depth value of the feature point of the previous frame,
depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;
s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates.
As a preferred technical solution of the present invention, the depth calculation network in step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame.
As a preferred technical solution of the present invention, in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:
wherein,
is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,
for the three-dimensional coordinates of the mapped p-points,
and
is approximately equal to the relationship;
is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,
and
is approximately equal to the relationship.
As a preferred embodiment of the present invention, the step S7 constructs an object function based on differences between a series of synthesized images and an object image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle
The objective function is:
where p represents all points in an image,
in order to be a target frame, the frame is,
is the synthesized target frame image.
Compared with the prior art, the invention has the beneficial effects that: the problem of difficulty in obtaining the image depth caused by the fact that a fixed monocular camera does not move due to the fact that the visual angle is fixed is solved by converting the depth calculation problem in the scene of the fixed camera of the traffic camera into the position and the posture calculation of the virtual mobile camera and the depth calculation; the segmented vehicle moving images provide a basis for monocular camera pose initialization and subsequent pose calculation for virtual pose calculation; according to the technical scheme, the depth calculation can be completed only by video data acquired by a fixed camera with known camera internal parameters, so that the use of an actual traffic monitoring scene is facilitated, each camera can learn a depth calculation model, and the problem of poor generalization performance of the depth model caused by the camera setting height, scene change and the like is not required to be considered; the depth map obtained through the calculation of the depth calculation model can provide a basis for the work of distance measurement, positioning and the like of the traffic camera vehicle.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the drawings of the embodiments of the present invention.
Example (b): as shown in fig. 1 to 4, the depth map estimation method based on the traffic camera specifically includes the following steps:
s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;
s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;
the specific steps of step S2 are:
s21: segmenting all vehicles of each frame in the video by using an image semantic segmentation model depeplabv 3plus trained on an urban road scene data set cityscaps, and merging the rest classes into other classes, as shown in fig. 2;
s22: tracking each vehicle by adopting a template matching method;
s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;
s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;
s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;
s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle; only the vehicle contour is segmented in the step; in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and the calculation formula of the correlation coefficient is:
wherein
The image of the template is represented by,
the difference between the template and the template mean is represented,
representing the difference between the original image and the mean value of the original image(ii) a On the basis that the matching coefficient is larger than 0.7, selecting a vehicle with the maximum matching degree with the template as a tracking target;
s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame; the specific steps of step S3 are: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; the set background color and the color of the target vehicle form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame is obtained; the method mainly comprises the following steps that background replacement is carried out on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, and therefore, the subsequent feature points can be conveniently extracted and matched; the removing method comprises the steps of firstly counting the number N of vehicles in an image, generating N background images with consistent background colors, extracting RBG values of vehicle pixels by taking vehicle segmentation outlines as position references, and replacing the RBG values at the same positions in a target background, thereby generating N images only containing one target vehicle. Wherein the background pixel RGB values are set (192,192,192) to be distinguished from the vehicle, as shown in fig. 3;
s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;
said step S4 determining a virtual pose by converting the motion of the vehicle relative to the camera into a virtual motion of the camera; as shown in figure 4 of the drawings,
is the three-dimensional spatial position of the optical center of the camera,
is the initial position of the moving target point. The target point moving track is
Corresponding to the optical center of the camera relative to
Is virtualizedMoving track
;
The method specifically comprises the following steps of:
s41: firstly, extracting ORB characteristics from target vehicles in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:
s411, ORB features are extracted: firstly, using FAST algorithm to extract corner points, selecting pixels p in the image, wherein the brightness of the pixels p is
Setting a threshold value
Selecting 16 pixel points on a circle with the radius of 3 by taking P as the center, if the brightness of continuous 12 points is more than that of the circle
Or less than
If so, taking p as a feature point; performing the above operation on each pixel in the image until all the characteristic points are found; when the number of the feature points extracted from one frame is less than 100, discarding the frame; calculating a descriptor of the feature point by using a BRIEF algorithm, expressing the size relation of 128 random pixel points near the feature point by using 0 and 1, and forming a vector;
s412, feature matching: fast matching features using a fast approximate nearest neighbor (FLANN) algorithm, wherein the fast approximate nearest neighbor (FLANN) algorithm is from the documents Muja, m.,& Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340), 2, by an open source library FLANN (https:// githu. com/mariusmuja/FLANN);
the S413 algorithm randomly extracts 4 non-collinear samples from the matched data set to calculate a parameter matrix of the samples, then tests all data by using the model, calculates the projection error of the data meeting the model, and is the optimal model when the projection error is minimum;
the Random sample consensus RANSAC algorithm is available from Fischler, M.A., & Bolles, R.C. (1981.) Random sample consensus a party for model fitting with applications to image analysis and automated card graphics, communication of the ACM, 24(6), 381 and 395.;
s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:
wherein U and V are orthogonal arrays,
is a singular value matrix;
then:
wherein
Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;
s43: then according to threeCalculating a three-dimensional coordinate P of the matched characteristic point by using an angle measurement principle; the method specifically comprises the following steps: is provided with
And the normalized coordinates of the two feature points satisfy the following conditions:
wherein
Is the depth value of the feature point of the previous frame,
depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;
s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates by using an EPnP algorithm. Wherein the camera intrinsic reference K comprises a camera focal length and an optical center, whose coordinates in the image plane are known; wherein the EPnP algorithm is available from Lepetit, V., Moreno-Noguer, F.,& Fua, P. (2009). Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2), 155;
s5: calculating the pixel-based depth value of each frame through the depth calculation network for the continuous frame images of the single vehicle tracking video processed in the step S3; the depth calculation network in the step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame; specific structures of the DispNet network are given in Mayer, n., Ilg, e., hauser, p., Fischer, p, Cremers, d, Dosovitskiy, a.,& Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).;
s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;
in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:
wherein,
is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,
for the three-dimensional coordinates of the mapped p-points,
and
is approximately equal to the relationship;
is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,
and
is approximately equal to the relationship;
s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training a depth model corresponding to the depth calculation network in the step S5 according to the constraint of the objective function, so as to obtain a depth calculation map of the traffic camera video;
the step S7 constructing an objective function based on differences between a series of synthesized images and an objective image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle
The objective function is:
where p represents all points in an image,
in order to be a target frame, the frame is,
is the synthesized target frame image;
training a depth estimation model according to the constraint of the target function; the learning rate was set to 0.0002 and the number of iterations was 40 ten thousand.
The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.