CN112801074B

CN112801074B - Depth map estimation method based on traffic camera

Info

Publication number: CN112801074B
Application number: CN202110403339.6A
Authority: CN
Inventors: 李俊; 宛蓉; 吉玮
Original assignee: Speed Space Time Information Technology Co Ltd
Current assignee: Shenzhen Tianshu Intelligent Co ltd
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2021-07-16
Anticipated expiration: 2041-04-15
Also published as: CN112801074A

Abstract

The invention discloses a depth map estimation method based on a traffic camera, which comprises the following specific steps: s1, collecting traffic video data through a fixed position and a fixed-focus traffic camera; s2, vehicle segmentation and tracking are carried out on the traffic video data by adopting a semantic segmentation method; s3, replacing the background of the continuous frame images of the vehicle tracking video in the step S2 to obtain original frames; s4, calculating the virtual pose of the camera on the continuous frame images of the vehicle tracking video processed in the step S3; s5 calculating the depth value of each frame based on the pixel through the depth calculation network by the continuous frame image of the single vehicle tracking video processed in the step S3; s6 synthesizing the target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame; s7, according to the synthesized frame in the step S6 and the original frame in the step S3, an objective function is constructed, a depth model corresponding to a depth calculation network is trained, and a depth calculation map of the traffic camera video is obtained.

Description

Depth map estimation method based on traffic camera

Technical Field

The invention relates to the technical field of video monitoring, in particular to a depth map estimation method based on a traffic camera.

Background

The depth measurement of the target in the video through the traffic monitoring camera is applied to intelligent monitoring application such as vehicle positioning, speed measurement and the like. Because the traffic monitoring mostly adopts a fixed monocular camera, three-dimensional depth information cannot be directly obtained, and external three-dimensional information is mostly needed for reference for distance measurement and positioning of a target point in a video.

Determining the external parameter of the camera is a common method, for example, patent CN103578109A proposes a method of obtaining a calibration point by placing a calibration tool for multiple times, calculating parameters of the camera according to the relationship between an image coordinate system and a world coordinate system, and obtaining the distance of a distance measurement point by projection of the image coordinate system. Patent CN108805936A proposes a method for determining more than four video images and corresponding points in physical space to calculate external parameters by a method of rendering grids without setting markers, and obtaining satisfactory external parameters of a camera by calibrating more physical points and correcting. The method of obtaining the target point depth constrained by the geometrical relationship between the measurements of the reference object of pre-known dimensions, the ground assumption, the camera height, etc. also requires manual assistance and pre-calculation.

Methods of using other devices to assist ranging, such as the ranging method and device proposed in patent CN104079868B, obtain the distance of the target point by means of a laser-assisted camera; patent CN105550670B uses two cameras, and uses the principle of binocular camera recognition and positioning to generate a three-dimensional point cloud of a viewing angle area, so as to obtain a target position. In patent CN102168954B, two images of the same target are collected by the rotation of the pan-tilt, and the depth information of the stationary target is measured, so as to obtain the depth of the image point corresponding to the three-dimensional point. Such depth calculation methods by binocular or multi-ocular imaging principles and stereo matching are costly.

Therefore, it is necessary to develop a depth map estimation method based on a traffic camera, which can complete depth calculation for target ranging or positioning using data collected by a fixed camera.

Disclosure of Invention

The invention aims to provide a depth map estimation method based on a traffic camera, which can complete depth calculation for target distance measurement or positioning by using data acquired by a fixed camera.

In order to solve the technical problems, the invention adopts the technical scheme that: the depth map estimation method based on the traffic camera specifically comprises the following steps:

s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;

s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;

s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame;

s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;

s5: calculating a pixel-based depth value of each frame through a depth neural network for the continuous frame images of the single vehicle tracking video processed in the step S3;

s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;

s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training the depth neural network model in the step S5 according to the constraint of the objective function, so as to obtain a depth computation graph of the traffic camera video.

As a preferred technical solution of the present invention, the step S2 includes the following steps:

s21: segmenting all vehicles of each frame in the traffic video data by using an image semantic segmentation model, and combining the categories except the vehicles;

s22: tracking each vehicle by adopting a template matching method;

s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;

s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;

s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;

s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle;

as a preferred technical solution of the present invention, in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and a calculation formula of the correlation coefficient is:

；

wherein

The image of the template is represented by,

the difference between the template and the template mean is represented,

the difference between the original image and the original image mean value is shown. Only the vehicle contour is segmented in this step.

As a preferred technical solution of the present invention, the step S3 includes the following steps: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; and the set background color and the target vehicle color form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame can be obtained. The method mainly comprises the step of carrying out background replacement on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, thereby being convenient for the extraction and matching of subsequent feature points.

As a preferred embodiment of the present invention, in step S4, the virtual pose is determined by converting the motion of the vehicle relative to the camera into a virtual motion of the camera;

the method specifically comprises the following steps of:

s41: firstly, extracting the characteristics of a target vehicle in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:

；

s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:

；

wherein U and V are orthogonal arrays,

is a singular value matrix;

then:

；

；

wherein

Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;

s43: calculating a matched three-dimensional coordinate P of the characteristic point according to a triangulation principle; the method specifically comprises the following steps: is provided with

And the normalized coordinates of the two feature points satisfy the following conditions:

；

wherein

Is the depth value of the feature point of the previous frame,

depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;

s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates.

As a preferred technical solution of the present invention, the depth calculation network in step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame.

As a preferred technical solution of the present invention, in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:

；

；

wherein,

is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,

for the three-dimensional coordinates of the mapped p-points,

and

is approximately equal to the relationship;

is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,

and

is approximately equal to the relationship.

As a preferred embodiment of the present invention, the step S7 constructs an object function based on differences between a series of synthesized images and an object image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle

The objective function is:

；

where p represents all points in an image,

in order to be a target frame, the frame is,

is the synthesized target frame image.

Compared with the prior art, the invention has the beneficial effects that: the problem of difficulty in obtaining the image depth caused by the fact that a fixed monocular camera does not move due to the fact that the visual angle is fixed is solved by converting the depth calculation problem in the scene of the fixed camera of the traffic camera into the position and the posture calculation of the virtual mobile camera and the depth calculation; the segmented vehicle moving images provide a basis for monocular camera pose initialization and subsequent pose calculation for virtual pose calculation; according to the technical scheme, the depth calculation can be completed only by video data acquired by a fixed camera with known camera internal parameters, so that the use of an actual traffic monitoring scene is facilitated, each camera can learn a depth calculation model, and the problem of poor generalization performance of the depth model caused by the camera setting height, scene change and the like is not required to be considered; the depth map obtained through the calculation of the depth calculation model can provide a basis for the work of distance measurement, positioning and the like of the traffic camera vehicle.

Drawings

FIG. 1 is a training process and a system setup process of the traffic camera-based depth map estimation method of the present invention;

FIG. 2 is a vehicle segmentation result of the depth map estimation method based on the traffic camera according to the present invention;

FIG. 3 is a background replacement result of the depth map estimation method based on a traffic camera according to the present invention;

fig. 4 is a schematic view of a virtual pose of the depth map estimation method based on the traffic camera according to the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the drawings of the embodiments of the present invention.

Example (b): as shown in fig. 1 to 4, the depth map estimation method based on the traffic camera specifically includes the following steps:

the specific steps of step S2 are:

s21: segmenting all vehicles of each frame in the video by using an image semantic segmentation model depeplabv 3plus trained on an urban road scene data set cityscaps, and merging the rest classes into other classes, as shown in fig. 2;

s22: tracking each vehicle by adopting a template matching method;

s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle; only the vehicle contour is segmented in the step; in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and the calculation formula of the correlation coefficient is:

；

wherein

The image of the template is represented by,

the difference between the template and the template mean is represented,

representing the difference between the original image and the mean value of the original image(ii) a On the basis that the matching coefficient is larger than 0.7, selecting a vehicle with the maximum matching degree with the template as a tracking target;

s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame; the specific steps of step S3 are: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; the set background color and the color of the target vehicle form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame is obtained; the method mainly comprises the following steps that background replacement is carried out on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, and therefore, the subsequent feature points can be conveniently extracted and matched; the removing method comprises the steps of firstly counting the number N of vehicles in an image, generating N background images with consistent background colors, extracting RBG values of vehicle pixels by taking vehicle segmentation outlines as position references, and replacing the RBG values at the same positions in a target background, thereby generating N images only containing one target vehicle. Wherein the background pixel RGB values are set (192,192,192) to be distinguished from the vehicle, as shown in fig. 3;

said step S4 determining a virtual pose by converting the motion of the vehicle relative to the camera into a virtual motion of the camera; as shown in figure 4 of the drawings,

is the three-dimensional spatial position of the optical center of the camera,

is the initial position of the moving target point. The target point moving track is

Corresponding to the optical center of the camera relative to

Is virtualizedMoving track

；

The method specifically comprises the following steps of:

s41: firstly, extracting ORB characteristics from target vehicles in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:

；

s411, ORB features are extracted: firstly, using FAST algorithm to extract corner points, selecting pixels p in the image, wherein the brightness of the pixels p is

Setting a threshold value

Selecting 16 pixel points on a circle with the radius of 3 by taking P as the center, if the brightness of continuous 12 points is more than that of the circle

Or less than

If so, taking p as a feature point; performing the above operation on each pixel in the image until all the characteristic points are found; when the number of the feature points extracted from one frame is less than 100, discarding the frame; calculating a descriptor of the feature point by using a BRIEF algorithm, expressing the size relation of 128 random pixel points near the feature point by using 0 and 1, and forming a vector;

s412, feature matching: fast matching features using a fast approximate nearest neighbor (FLANN) algorithm, wherein the fast approximate nearest neighbor (FLANN) algorithm is from the documents Muja, m.,& Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340), 2, by an open source library FLANN (https:// githu. com/mariusmuja/FLANN);

the S413 algorithm randomly extracts 4 non-collinear samples from the matched data set to calculate a parameter matrix of the samples, then tests all data by using the model, calculates the projection error of the data meeting the model, and is the optimal model when the projection error is minimum;

the Random sample consensus RANSAC algorithm is available from Fischler, M.A., & Bolles, R.C. (1981.) Random sample consensus a party for model fitting with applications to image analysis and automated card graphics, communication of the ACM, 24(6), 381 and 395.;

；

wherein U and V are orthogonal arrays,

is a singular value matrix;

then:

；

；

wherein

s43: then according to threeCalculating a three-dimensional coordinate P of the matched characteristic point by using an angle measurement principle; the method specifically comprises the following steps: is provided with

；

wherein

Is the depth value of the feature point of the previous frame,

s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates by using an EPnP algorithm. Wherein the camera intrinsic reference K comprises a camera focal length and an optical center, whose coordinates in the image plane are known; wherein the EPnP algorithm is available from Lepetit, V., Moreno-Noguer, F.,& Fua, P. (2009). Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2), 155；

s5: calculating the pixel-based depth value of each frame through the depth calculation network for the continuous frame images of the single vehicle tracking video processed in the step S3; the depth calculation network in the step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame; specific structures of the DispNet network are given in Mayer, n., Ilg, e., hauser, p., Fischer, p, Cremers, d, Dosovitskiy, a.,& Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).；

in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:

；

；

wherein,

for the three-dimensional coordinates of the mapped p-points,

and

is approximately equal to the relationship;

and

is approximately equal to the relationship;

s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training a depth model corresponding to the depth calculation network in the step S5 according to the constraint of the objective function, so as to obtain a depth calculation map of the traffic camera video;

the step S7 constructing an objective function based on differences between a series of synthesized images and an objective image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle

The objective function is:

；

where p represents all points in an image,

in order to be a target frame, the frame is,

is the synthesized target frame image;

training a depth estimation model according to the constraint of the target function; the learning rate was set to 0.0002 and the number of iterations was 40 ten thousand.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A depth map estimation method based on a traffic camera is characterized by specifically comprising the following steps:

s5: the continuous frame images of the single vehicle tracking video processed in the step S3 estimate a depth value of each frame based on pixels through a depth neural network;

s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training the depth neural network in the step S5 according to the constraint of the objective function, so as to obtain a depth computation graph of the traffic camera video;

the specific steps of step S2 are:

s22: tracking each vehicle by adopting a template matching method;

in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and the calculation formula of the correlation coefficient is:

；

wherein

The image of the template is represented by,

the difference between the template and the template mean is represented,

the difference between the original image and the original image mean value is shown.

2. The method for estimating the depth map based on the traffic camera as claimed in claim 1, wherein the step S3 specifically comprises the steps of: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; and the set background color and the target vehicle color form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame can be obtained.

3. The traffic camera-based depth map estimation method according to claim 1, wherein the step S4 determines the virtual pose by converting the motion of the vehicle relative to the camera into a virtual motion of the camera;

the method specifically comprises the following steps of:

；

；

wherein U and V are orthogonal arrays,

is a singular value matrix;

then:

；

；

wherein

；

wherein

Is the depth value of the feature point of the previous frame,

4. The method for estimating a depth map based on a traffic camera according to claim 1, wherein the depth calculation network in step S5 is a DispNet network, and the depth values of the image pixels are calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame.

5. The method for estimating a depth map based on a traffic camera as claimed in claim 3, wherein in step S6, the image of the composite frame is calculated according to the pose of the image of the frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:

；

；

wherein,

for the three-dimensional coordinates of the mapped p-points,

and

is approximately equal to the relationship;

and

is approximately equal to the relationship.

6. The method for estimating a depth map based on a traffic camera according to claim 5, wherein said step S7 constructs an objective function based on the difference between a series of synthetic images and a target image according to the synthetic frame of said step S6 and the original frame of said step S3; having n training frame images for a single target vehicle

The objective function is:

；

where p represents all points in an image,

in order to be a target frame, the frame is,

is the synthesized target frame image.