CN112801074B - Depth map estimation method based on traffic camera - Google Patents

Depth map estimation method based on traffic camera Download PDF

Info

Publication number
CN112801074B
CN112801074B CN202110403339.6A CN202110403339A CN112801074B CN 112801074 B CN112801074 B CN 112801074B CN 202110403339 A CN202110403339 A CN 202110403339A CN 112801074 B CN112801074 B CN 112801074B
Authority
CN
China
Prior art keywords
frame
depth
image
vehicle
camera
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110403339.6A
Other languages
Chinese (zh)
Other versions
CN112801074A (en
Inventor
李俊
宛蓉
吉玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tianshu Intelligent Co ltd
Original Assignee
Speed Space Time Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Speed Space Time Information Technology Co Ltd filed Critical Speed Space Time Information Technology Co Ltd
Priority to CN202110403339.6A priority Critical patent/CN112801074B/en
Publication of CN112801074A publication Critical patent/CN112801074A/en
Application granted granted Critical
Publication of CN112801074B publication Critical patent/CN112801074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/08Detecting or categorising vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a depth map estimation method based on a traffic camera, which comprises the following specific steps: s1, collecting traffic video data through a fixed position and a fixed-focus traffic camera; s2, vehicle segmentation and tracking are carried out on the traffic video data by adopting a semantic segmentation method; s3, replacing the background of the continuous frame images of the vehicle tracking video in the step S2 to obtain original frames; s4, calculating the virtual pose of the camera on the continuous frame images of the vehicle tracking video processed in the step S3; s5 calculating the depth value of each frame based on the pixel through the depth calculation network by the continuous frame image of the single vehicle tracking video processed in the step S3; s6 synthesizing the target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame; s7, according to the synthesized frame in the step S6 and the original frame in the step S3, an objective function is constructed, a depth model corresponding to a depth calculation network is trained, and a depth calculation map of the traffic camera video is obtained.

Description

Depth map estimation method based on traffic camera
Technical Field
The invention relates to the technical field of video monitoring, in particular to a depth map estimation method based on a traffic camera.
Background
The depth measurement of the target in the video through the traffic monitoring camera is applied to intelligent monitoring application such as vehicle positioning, speed measurement and the like. Because the traffic monitoring mostly adopts a fixed monocular camera, three-dimensional depth information cannot be directly obtained, and external three-dimensional information is mostly needed for reference for distance measurement and positioning of a target point in a video.
Determining the external parameter of the camera is a common method, for example, patent CN103578109A proposes a method of obtaining a calibration point by placing a calibration tool for multiple times, calculating parameters of the camera according to the relationship between an image coordinate system and a world coordinate system, and obtaining the distance of a distance measurement point by projection of the image coordinate system. Patent CN108805936A proposes a method for determining more than four video images and corresponding points in physical space to calculate external parameters by a method of rendering grids without setting markers, and obtaining satisfactory external parameters of a camera by calibrating more physical points and correcting. The method of obtaining the target point depth constrained by the geometrical relationship between the measurements of the reference object of pre-known dimensions, the ground assumption, the camera height, etc. also requires manual assistance and pre-calculation.
Methods of using other devices to assist ranging, such as the ranging method and device proposed in patent CN104079868B, obtain the distance of the target point by means of a laser-assisted camera; patent CN105550670B uses two cameras, and uses the principle of binocular camera recognition and positioning to generate a three-dimensional point cloud of a viewing angle area, so as to obtain a target position. In patent CN102168954B, two images of the same target are collected by the rotation of the pan-tilt, and the depth information of the stationary target is measured, so as to obtain the depth of the image point corresponding to the three-dimensional point. Such depth calculation methods by binocular or multi-ocular imaging principles and stereo matching are costly.
Therefore, it is necessary to develop a depth map estimation method based on a traffic camera, which can complete depth calculation for target ranging or positioning using data collected by a fixed camera.
Disclosure of Invention
The invention aims to provide a depth map estimation method based on a traffic camera, which can complete depth calculation for target distance measurement or positioning by using data acquired by a fixed camera.
In order to solve the technical problems, the invention adopts the technical scheme that: the depth map estimation method based on the traffic camera specifically comprises the following steps:
s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;
s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;
s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame;
s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;
s5: calculating a pixel-based depth value of each frame through a depth neural network for the continuous frame images of the single vehicle tracking video processed in the step S3;
s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;
s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training the depth neural network model in the step S5 according to the constraint of the objective function, so as to obtain a depth computation graph of the traffic camera video.
As a preferred technical solution of the present invention, the step S2 includes the following steps:
s21: segmenting all vehicles of each frame in the traffic video data by using an image semantic segmentation model, and combining the categories except the vehicles;
s22: tracking each vehicle by adopting a template matching method;
s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;
s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;
s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;
s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle;
as a preferred technical solution of the present invention, in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and a calculation formula of the correlation coefficient is:
Figure 489422DEST_PATH_IMAGE001
wherein
Figure 16349DEST_PATH_IMAGE002
The image of the template is represented by,
Figure 61666DEST_PATH_IMAGE003
the difference between the template and the template mean is represented,
Figure 253613DEST_PATH_IMAGE004
the difference between the original image and the original image mean value is shown. Only the vehicle contour is segmented in this step.
As a preferred technical solution of the present invention, the step S3 includes the following steps: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; and the set background color and the target vehicle color form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame can be obtained. The method mainly comprises the step of carrying out background replacement on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, thereby being convenient for the extraction and matching of subsequent feature points.
As a preferred embodiment of the present invention, in step S4, the virtual pose is determined by converting the motion of the vehicle relative to the camera into a virtual motion of the camera;
the method specifically comprises the following steps of:
s41: firstly, extracting the characteristics of a target vehicle in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:
Figure 802406DEST_PATH_IMAGE005
s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:
Figure 624868DEST_PATH_IMAGE006
wherein U and V are orthogonal arrays,
Figure 439372DEST_PATH_IMAGE007
is a singular value matrix;
then:
Figure 372693DEST_PATH_IMAGE008
Figure 572730DEST_PATH_IMAGE009
wherein
Figure 566094DEST_PATH_IMAGE010
Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;
s43: calculating a matched three-dimensional coordinate P of the characteristic point according to a triangulation principle; the method specifically comprises the following steps: is provided with
Figure 117161DEST_PATH_IMAGE011
And the normalized coordinates of the two feature points satisfy the following conditions:
Figure 588593DEST_PATH_IMAGE012
wherein
Figure 908716DEST_PATH_IMAGE013
Is the depth value of the feature point of the previous frame,
Figure 338560DEST_PATH_IMAGE014
depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;
s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates.
As a preferred technical solution of the present invention, the depth calculation network in step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame.
As a preferred technical solution of the present invention, in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:
Figure 314607DEST_PATH_IMAGE015
Figure 137200DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 515092DEST_PATH_IMAGE017
is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,
Figure 850258DEST_PATH_IMAGE018
for the three-dimensional coordinates of the mapped p-points,
Figure 375918DEST_PATH_IMAGE020
and
Figure 189153DEST_PATH_IMAGE022
is approximately equal to the relationship;
Figure 483868DEST_PATH_IMAGE024
is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,
Figure 989936DEST_PATH_IMAGE026
and
Figure 737312DEST_PATH_IMAGE028
is approximately equal to the relationship.
As a preferred embodiment of the present invention, the step S7 constructs an object function based on differences between a series of synthesized images and an object image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle
Figure 619817DEST_PATH_IMAGE029
The objective function is:
Figure 706722DEST_PATH_IMAGE030
where p represents all points in an image,
Figure 193811DEST_PATH_IMAGE031
in order to be a target frame, the frame is,
Figure 366166DEST_PATH_IMAGE032
is the synthesized target frame image.
Compared with the prior art, the invention has the beneficial effects that: the problem of difficulty in obtaining the image depth caused by the fact that a fixed monocular camera does not move due to the fact that the visual angle is fixed is solved by converting the depth calculation problem in the scene of the fixed camera of the traffic camera into the position and the posture calculation of the virtual mobile camera and the depth calculation; the segmented vehicle moving images provide a basis for monocular camera pose initialization and subsequent pose calculation for virtual pose calculation; according to the technical scheme, the depth calculation can be completed only by video data acquired by a fixed camera with known camera internal parameters, so that the use of an actual traffic monitoring scene is facilitated, each camera can learn a depth calculation model, and the problem of poor generalization performance of the depth model caused by the camera setting height, scene change and the like is not required to be considered; the depth map obtained through the calculation of the depth calculation model can provide a basis for the work of distance measurement, positioning and the like of the traffic camera vehicle.
Drawings
FIG. 1 is a training process and a system setup process of the traffic camera-based depth map estimation method of the present invention;
FIG. 2 is a vehicle segmentation result of the depth map estimation method based on the traffic camera according to the present invention;
FIG. 3 is a background replacement result of the depth map estimation method based on a traffic camera according to the present invention;
fig. 4 is a schematic view of a virtual pose of the depth map estimation method based on the traffic camera according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the drawings of the embodiments of the present invention.
Example (b): as shown in fig. 1 to 4, the depth map estimation method based on the traffic camera specifically includes the following steps:
s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;
s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;
the specific steps of step S2 are:
s21: segmenting all vehicles of each frame in the video by using an image semantic segmentation model depeplabv 3plus trained on an urban road scene data set cityscaps, and merging the rest classes into other classes, as shown in fig. 2;
s22: tracking each vehicle by adopting a template matching method;
s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;
s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;
s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;
s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle; only the vehicle contour is segmented in the step; in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and the calculation formula of the correlation coefficient is:
Figure 786783DEST_PATH_IMAGE033
wherein
Figure 56090DEST_PATH_IMAGE034
The image of the template is represented by,
Figure 903961DEST_PATH_IMAGE035
the difference between the template and the template mean is represented,
Figure 625929DEST_PATH_IMAGE036
representing the difference between the original image and the mean value of the original image(ii) a On the basis that the matching coefficient is larger than 0.7, selecting a vehicle with the maximum matching degree with the template as a tracking target;
s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame; the specific steps of step S3 are: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; the set background color and the color of the target vehicle form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame is obtained; the method mainly comprises the following steps that background replacement is carried out on each frame of image so as to avoid the influence of a static background and other dynamic objects in the image on the pose estimation of the virtual camera, and therefore, the subsequent feature points can be conveniently extracted and matched; the removing method comprises the steps of firstly counting the number N of vehicles in an image, generating N background images with consistent background colors, extracting RBG values of vehicle pixels by taking vehicle segmentation outlines as position references, and replacing the RBG values at the same positions in a target background, thereby generating N images only containing one target vehicle. Wherein the background pixel RGB values are set (192,192,192) to be distinguished from the vehicle, as shown in fig. 3;
s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;
said step S4 determining a virtual pose by converting the motion of the vehicle relative to the camera into a virtual motion of the camera; as shown in figure 4 of the drawings,
Figure 850237DEST_PATH_IMAGE037
is the three-dimensional spatial position of the optical center of the camera,
Figure 646155DEST_PATH_IMAGE038
is the initial position of the moving target point. The target point moving track is
Figure 727243DEST_PATH_IMAGE039
Corresponding to the optical center of the camera relative to
Figure 874191DEST_PATH_IMAGE038
Is virtualizedMoving track
Figure 449660DEST_PATH_IMAGE040
The method specifically comprises the following steps of:
s41: firstly, extracting ORB characteristics from target vehicles in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:
Figure 631242DEST_PATH_IMAGE041
s411, ORB features are extracted: firstly, using FAST algorithm to extract corner points, selecting pixels p in the image, wherein the brightness of the pixels p is
Figure 555336DEST_PATH_IMAGE042
Setting a threshold value
Figure 251896DEST_PATH_IMAGE043
Selecting 16 pixel points on a circle with the radius of 3 by taking P as the center, if the brightness of continuous 12 points is more than that of the circle
Figure 818007DEST_PATH_IMAGE044
Or less than
Figure 854096DEST_PATH_IMAGE045
If so, taking p as a feature point; performing the above operation on each pixel in the image until all the characteristic points are found; when the number of the feature points extracted from one frame is less than 100, discarding the frame; calculating a descriptor of the feature point by using a BRIEF algorithm, expressing the size relation of 128 random pixel points near the feature point by using 0 and 1, and forming a vector;
s412, feature matching: fast matching features using a fast approximate nearest neighbor (FLANN) algorithm, wherein the fast approximate nearest neighbor (FLANN) algorithm is from the documents Muja, m.,& Lowe, D. G. (2009). Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2(331-340), 2, by an open source library FLANN (https:// githu. com/mariusmuja/FLANN);
the S413 algorithm randomly extracts 4 non-collinear samples from the matched data set to calculate a parameter matrix of the samples, then tests all data by using the model, calculates the projection error of the data meeting the model, and is the optimal model when the projection error is minimum;
the Random sample consensus RANSAC algorithm is available from Fischler, M.A., & Bolles, R.C. (1981.) Random sample consensus a party for model fitting with applications to image analysis and automated card graphics, communication of the ACM, 24(6), 381 and 395.;
s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:
Figure 11408DEST_PATH_IMAGE046
wherein U and V are orthogonal arrays,
Figure 132948DEST_PATH_IMAGE007
is a singular value matrix;
then:
Figure 299487DEST_PATH_IMAGE047
Figure 190082DEST_PATH_IMAGE048
wherein
Figure 721558DEST_PATH_IMAGE049
Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;
s43: then according to threeCalculating a three-dimensional coordinate P of the matched characteristic point by using an angle measurement principle; the method specifically comprises the following steps: is provided with
Figure 877864DEST_PATH_IMAGE050
And the normalized coordinates of the two feature points satisfy the following conditions:
Figure 785777DEST_PATH_IMAGE051
wherein
Figure 796458DEST_PATH_IMAGE052
Is the depth value of the feature point of the previous frame,
Figure DEST_PATH_IMAGE053
depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;
s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates by using an EPnP algorithm. Wherein the camera intrinsic reference K comprises a camera focal length and an optical center, whose coordinates in the image plane are known; wherein the EPnP algorithm is available from Lepetit, V., Moreno-Noguer, F.,& Fua, P. (2009). Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision, 81(2), 155;
s5: calculating the pixel-based depth value of each frame through the depth calculation network for the continuous frame images of the single vehicle tracking video processed in the step S3; the depth calculation network in the step S5 is a DispNet network, and the depth value of the image pixel is calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame; specific structures of the DispNet network are given in Mayer, n., Ilg, e., hauser, p., Fischer, p, Cremers, d, Dosovitskiy, a.,& Brox, T. (2016). A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4040-4048).;
s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;
in step S6, the image of the synthesized frame is calculated according to the pose of the image of the next frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:
Figure 295573DEST_PATH_IMAGE015
Figure 188443DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure 900047DEST_PATH_IMAGE054
is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,
Figure 499655DEST_PATH_IMAGE018
for the three-dimensional coordinates of the mapped p-points,
Figure 435250DEST_PATH_IMAGE020
and
Figure 753099DEST_PATH_IMAGE022
is approximately equal to the relationship;
Figure 2815DEST_PATH_IMAGE024
is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,
Figure 535558DEST_PATH_IMAGE026
and
Figure 579738DEST_PATH_IMAGE028
is approximately equal to the relationship;
s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training a depth model corresponding to the depth calculation network in the step S5 according to the constraint of the objective function, so as to obtain a depth calculation map of the traffic camera video;
the step S7 constructing an objective function based on differences between a series of synthesized images and an objective image from the synthesized frame of the step S6 and the original frame of the step S3; having n training frame images for a single target vehicle
Figure 447200DEST_PATH_IMAGE029
The objective function is:
Figure DEST_PATH_IMAGE055
where p represents all points in an image,
Figure 235027DEST_PATH_IMAGE031
in order to be a target frame, the frame is,
Figure 871545DEST_PATH_IMAGE032
is the synthesized target frame image;
training a depth estimation model according to the constraint of the target function; the learning rate was set to 0.0002 and the number of iterations was 40 ten thousand.
The above description is only exemplary of the present invention and should not be taken as limiting the invention, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. A depth map estimation method based on a traffic camera is characterized by specifically comprising the following steps:
s1: collecting traffic video data through a fixed position and fixed-focus traffic camera;
s2: performing vehicle segmentation and tracking on the traffic video data acquired in the step S1 by adopting a semantic segmentation method to obtain a vehicle tracking video;
s3: performing background replacement on the continuous frame images of the vehicle tracking video in the step S2 to obtain an original frame;
s4: calculating the virtual pose of the camera on the continuous frame images of the single vehicle tracking video processed in the step S3;
s5: the continuous frame images of the single vehicle tracking video processed in the step S3 estimate a depth value of each frame based on pixels through a depth neural network;
s6: synthesizing a target frame image according to the virtual pose of the step S4 and the depth value of the pixel of the step S5 to obtain a synthesized frame;
s7: constructing an objective function according to the synthesized frame in the step S6 and the original frame in the step S3, and training the depth neural network in the step S5 according to the constraint of the objective function, so as to obtain a depth computation graph of the traffic camera video;
the specific steps of step S2 are:
s21: segmenting all vehicles of each frame in the traffic video data by using an image semantic segmentation model, and combining the categories except the vehicles;
s22: tracking each vehicle by adopting a template matching method;
s221: determining a rectangular envelope frame of the vehicle according to the minimum value and the maximum value of the coordinates of the vehicle segmentation contour;
s222: taking an envelope frame of a target vehicle in a previous frame of traffic video data as a template matched with a vehicle in a next frame;
s223: extracting the position of the center point of the template, and aligning the center point of the template with the center points of the envelope frames of all vehicles in the next frame one by one;
s224: then calculating the matching degree of the image area covered by the template and the template according to a normalized correlation coefficient matching method, and selecting the vehicle with the maximum matching degree with the template as a tracking target to realize the rapid tracking of the vehicle;
in step S224, the matching degree between the image area covered by the template and the template is calculated according to a normalized correlation coefficient matching method, and the calculation formula of the correlation coefficient is:
Figure DEST_PATH_IMAGE001
wherein
Figure DEST_PATH_IMAGE002
The image of the template is represented by,
Figure DEST_PATH_IMAGE003
the difference between the template and the template mean is represented,
Figure DEST_PATH_IMAGE004
the difference between the original image and the original image mean value is shown.
2. The method for estimating the depth map based on the traffic camera as claimed in claim 1, wherein the step S3 specifically comprises the steps of: firstly, extracting continuous frames of a single vehicle, and setting all pixels except a target vehicle in an image of each continuous frame as a uniform value; and the set background color and the target vehicle color form a contrast, so that subsequent feature points can be conveniently extracted and matched, and an original frame can be obtained.
3. The traffic camera-based depth map estimation method according to claim 1, wherein the step S4 determines the virtual pose by converting the motion of the vehicle relative to the camera into a virtual motion of the camera;
the method specifically comprises the following steps of:
s41: firstly, extracting the characteristics of a target vehicle in continuous frames and matching the characteristics of the front frame and the rear frame; removing noise points by a random sampling consensus RANSAC algorithm to solve a basic matrix F, and then solving an essential matrix E by combining camera intrinsic parameters K, wherein the relation between the camera intrinsic parameters K and the essential matrix E is as follows:
Figure DEST_PATH_IMAGE005
s42: then decomposing a rotation matrix R and a translation matrix t of the camera motion through Singular Value Decomposition (SVD); let the SVD of E be:
Figure DEST_PATH_IMAGE006
wherein U and V are orthogonal arrays,
Figure DEST_PATH_IMAGE007
is a singular value matrix;
then:
Figure DEST_PATH_IMAGE008
Figure DEST_PATH_IMAGE009
wherein
Figure DEST_PATH_IMAGE010
Representing a rotation matrix obtained by rotating 90 degrees around the Z axis; any point is brought into the four solutions of the formula, and if the depth value of the detection point is positive, the detection point is a correct solution;
s43: calculating a matched three-dimensional coordinate P of the characteristic point according to a triangulation principle; the method specifically comprises the following steps: is provided with
Figure DEST_PATH_IMAGE011
And the normalized coordinates of the two feature points satisfy the following conditions:
Figure DEST_PATH_IMAGE012
wherein
Figure DEST_PATH_IMAGE013
Is the depth value of the feature point of the previous frame,
Figure DEST_PATH_IMAGE014
depth values of feature points of a subsequent frame corresponding to the previous frame; then, solving a least square solution of the above formula to obtain the depths of the lower points of the two frames of images and obtaining the three-dimensional space position, namely the three-dimensional coordinate, of the matched characteristic point;
s44: after the pose is initialized, the pose between the adjacent frames is solved according to the three-dimensional coordinates of the feature points of the previous frame and the two-dimensional pixel coordinates of the feature points of the next frame corresponding to the three-dimensional coordinates.
4. The method for estimating a depth map based on a traffic camera according to claim 1, wherein the depth calculation network in step S5 is a DispNet network, and the depth values of the image pixels are calculated in an end-to-end manner; the input is a target frame image, and the output is a depth map formed by the depth values of all pixel points of the target frame.
5. The method for estimating a depth map based on a traffic camera as claimed in claim 3, wherein in step S6, the image of the composite frame is calculated according to the pose of the image of the frame after the target frame relative to the target frame and the estimated depth value of the target frame; the synthetic frame image is calculated in two steps, firstly the target frame image is mapped to a three-dimensional coordinate system through the depth value of the target frame, and then the three-dimensional point is remapped to the image under a new visual angle, and the calculation formula is as follows:
Figure DEST_PATH_IMAGE015
Figure DEST_PATH_IMAGE016
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE017
is the estimated depth of the p points in the target frame, K is the camera intrinsic parameter,
Figure DEST_PATH_IMAGE018
for the three-dimensional coordinates of the mapped p-points,
Figure 80013DEST_PATH_IMAGE018
and
Figure DEST_PATH_IMAGE019
is approximately equal to the relationship;
Figure DEST_PATH_IMAGE020
is the relative pose between the target frame and the previous frame, i.e., the virtual pose in said step S4,
Figure DEST_PATH_IMAGE021
and
Figure DEST_PATH_IMAGE022
is approximately equal to the relationship.
6. The method for estimating a depth map based on a traffic camera according to claim 5, wherein said step S7 constructs an objective function based on the difference between a series of synthetic images and a target image according to the synthetic frame of said step S6 and the original frame of said step S3; having n training frame images for a single target vehicle
Figure DEST_PATH_IMAGE023
The objective function is:
Figure DEST_PATH_IMAGE024
where p represents all points in an image,
Figure DEST_PATH_IMAGE025
in order to be a target frame, the frame is,
Figure DEST_PATH_IMAGE026
is the synthesized target frame image.
CN202110403339.6A 2021-04-15 2021-04-15 Depth map estimation method based on traffic camera Active CN112801074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110403339.6A CN112801074B (en) 2021-04-15 2021-04-15 Depth map estimation method based on traffic camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110403339.6A CN112801074B (en) 2021-04-15 2021-04-15 Depth map estimation method based on traffic camera

Publications (2)

Publication Number Publication Date
CN112801074A CN112801074A (en) 2021-05-14
CN112801074B true CN112801074B (en) 2021-07-16

Family

ID=75811436

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110403339.6A Active CN112801074B (en) 2021-04-15 2021-04-15 Depth map estimation method based on traffic camera

Country Status (1)

Country Link
CN (1) CN112801074B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269118B (en) * 2021-06-07 2022-10-11 重庆大学 Monocular vision forward vehicle distance detection method based on depth estimation
CN113255564B (en) * 2021-06-11 2022-05-06 上海交通大学 Real-time video identification accelerator based on key object splicing
CN113538350B (en) * 2021-06-29 2022-10-04 河北深保投资发展有限公司 Method for identifying depth of foundation pit based on multiple cameras
CN114119896B (en) * 2022-01-26 2022-04-15 南京信息工程大学 Driving path planning method
CN115294375B (en) * 2022-10-10 2022-12-13 南昌虚拟现实研究院股份有限公司 Speckle depth estimation method and system, electronic device and storage medium
CN117115786B (en) * 2023-10-23 2024-01-26 青岛哈尔滨工程大学创新发展中心 Depth estimation model training method for joint segmentation tracking and application method
CN117593528B (en) * 2024-01-18 2024-04-16 中数智科(杭州)科技有限公司 Rail vehicle bolt loosening detection method based on machine vision

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416840A (en) * 2018-03-14 2018-08-17 大连理工大学 A kind of dense method for reconstructing of three-dimensional scenic based on monocular camera
CN110120049A (en) * 2019-04-15 2019-08-13 天津大学 By single image Combined estimator scene depth and semantic method
CN110503680A (en) * 2019-08-29 2019-11-26 大连海事大学 It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method
CN110782490A (en) * 2019-09-24 2020-02-11 武汉大学 Video depth map estimation method and device with space-time consistency

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013187923A2 (en) * 2012-06-14 2013-12-19 Aerohive Networks, Inc. Multicast to unicast conversion technique
WO2014165244A1 (en) * 2013-03-13 2014-10-09 Pelican Imaging Corporation Systems and methods for synthesizing images from image data captured by an array camera using restricted depth of field depth maps in which depth estimation precision varies
CN110414674B (en) * 2019-07-31 2021-09-10 浙江科技学院 Monocular depth estimation method based on residual error network and local refinement
CN111340864B (en) * 2020-02-26 2023-12-12 浙江大华技术股份有限公司 Three-dimensional scene fusion method and device based on monocular estimation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416840A (en) * 2018-03-14 2018-08-17 大连理工大学 A kind of dense method for reconstructing of three-dimensional scenic based on monocular camera
CN110120049A (en) * 2019-04-15 2019-08-13 天津大学 By single image Combined estimator scene depth and semantic method
CN110503680A (en) * 2019-08-29 2019-11-26 大连海事大学 It is a kind of based on non-supervisory convolutional neural networks monocular scene depth estimation method
CN110782490A (en) * 2019-09-24 2020-02-11 武汉大学 Video depth map estimation method and device with space-time consistency

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Depth Estimation in Still Images and Videos Using a Motionless Monocular Camera;Sotirios Diamantas et al.;《2016 IEEE International Conference on Imaging Systems and Techniques (IST)》;20161110;1-6 *
基于深度学习的复杂场景下单目标跟踪算法的研究;郭雅丽;《中国优秀硕士学位论文全文数据库信息科技辑》;20200115;9-40 *
基于非监督的卷积神经网络单目场景深度估计方法研究;张跟跟;《中国优秀硕士学位论文全文数据库信息科技辑》;20200615;16-29 *
面向稠密重建的单目视觉SLAM方法;纪祥;《中国优秀硕士学位论文全文数据库信息科技辑》;20200215;15-16,32 *

Also Published As

Publication number Publication date
CN112801074A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
CN112801074B (en) Depth map estimation method based on traffic camera
CN110569704B (en) Multi-strategy self-adaptive lane line detection method based on stereoscopic vision
CN111462135B (en) Semantic mapping method based on visual SLAM and two-dimensional semantic segmentation
CN110349250B (en) RGBD camera-based three-dimensional reconstruction method for indoor dynamic scene
CN110264416B (en) Sparse point cloud segmentation method and device
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN105225482B (en) Vehicle detecting system and method based on binocular stereo vision
EP3182371B1 (en) Threshold determination in for example a type ransac algorithm
CN111539273A (en) Traffic video background modeling method and system
CN108597009B (en) Method for detecting three-dimensional target based on direction angle information
CN112037159B (en) Cross-camera road space fusion and vehicle target detection tracking method and system
CN111553252A (en) Road pedestrian automatic identification and positioning method based on deep learning and U-V parallax algorithm
CN111998862B (en) BNN-based dense binocular SLAM method
CN110941996A (en) Target and track augmented reality method and system based on generation of countermeasure network
CN113762009B (en) Crowd counting method based on multi-scale feature fusion and double-attention mechanism
EP3185212B1 (en) Dynamic particle filter parameterization
CN111709982B (en) Three-dimensional reconstruction method for dynamic environment
Burlacu et al. Obstacle detection in stereo sequences using multiple representations of the disparity map
CN110443228B (en) Pedestrian matching method and device, electronic equipment and storage medium
CN106709432B (en) Human head detection counting method based on binocular stereo vision
CN108090920B (en) Light field image depth stream estimation method
CN107944350B (en) Monocular vision road identification method based on appearance and geometric information fusion
CN103646397A (en) Real-time synthetic aperture perspective imaging method based on multi-source data fusion
CN111696147B (en) Depth estimation method based on improved YOLOv3 model
CN116804553A (en) Odometer system and method based on event camera/IMU/natural road sign

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230612

Address after: Room 403C, Building 2 and 3, Building M-10, Maqueling Industrial Zone, Maling Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province, 518057

Patentee after: Speed spatiotemporal big data research (Shenzhen) Co.,Ltd.

Address before: 210000 8 -22, 699 Xuanwu Road, Xuanwu District, Nanjing, Jiangsu.

Patentee before: SPEED TIME AND SPACE INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right
CP01 Change in the name or title of a patent holder

Address after: Room 403C, Building 2 and 3, Building M-10, Maqueling Industrial Zone, Maling Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province, 518057

Patentee after: Shenzhen Tianshu Intelligent Co.,Ltd.

Address before: Room 403C, Building 2 and 3, Building M-10, Maqueling Industrial Zone, Maling Community, Yuehai Street, Nanshan District, Shenzhen City, Guangdong Province, 518057

Patentee before: Speed spatiotemporal big data research (Shenzhen) Co.,Ltd.

CP01 Change in the name or title of a patent holder