CN116309773A

CN116309773A - Target depth estimation method and device based on monocular vision and vehicle

Info

Publication number: CN116309773A
Application number: CN202310193407.XA
Authority: CN
Inventors: 苏国威; 张超
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-06-23

Abstract

The disclosure provides a target depth estimation method, a target depth estimation device and a target depth estimation vehicle based on monocular vision, which relate to an automatic driving technology and comprise the following steps: acquiring original video frame data of an acquired target object acquired by a camera; extracting a plurality of target images from original video frame data, and forming target video frame data by the plurality of target images; identifying a target object in the target image to obtain actual pixel coordinates of the target object; sequentially determining predicted pixel coordinates of targets in other target images except the target image with the latest time in the target video frame data according to the time sequence; and updating and acquiring the target depth value according to the predicted pixel coordinates and the actual pixel coordinates. The actual pixel coordinates of the target object in the target image can be utilized to determine the predicted pixel coordinates of the target object in the next adjacent target image; and then updating and acquiring the target depth value according to the distance between the actual pixel coordinate and the predicted pixel coordinate, so that the real-time requirement can be met.

Description

Target depth estimation method and device based on monocular vision and vehicle

Technical Field

The disclosure relates to automatic driving technology, in particular to a target depth estimation method and device based on monocular vision and a vehicle.

Background

With the increasing demands of safety and comfort of automobiles, intelligent driving technology is receiving extensive attention and research. Environmental awareness is an important basis for achieving automobile intellectualization. The estimation of the target depth, i.e. the estimation of the longitudinal distance between the target and the camera arranged in the vehicle (which may also be approximated as the longitudinal distance between the target and the vehicle), is an important link in the perception of the environment.

In the prior art, a spatial measurement mode is mainly utilized, and the calculation is performed by relying on camera pose data so as to realize target depth estimation. However, because the camera pose data is required to be relied on, the calculation amount is large, the calculation speed is low, and the real-time requirement is difficult to meet.

Disclosure of Invention

The disclosure provides a target depth estimation method, device and vehicle based on monocular vision, which are used for solving the problem that real-time requirements are difficult to meet by estimating target depth in a space measurement mode in the prior art.

According to a first aspect of the present disclosure, there is provided a target depth estimation method based on monocular vision, including:

acquiring original video frame data of an acquired target object acquired by a camera arranged on a vehicle in the running process of the vehicle; extracting a plurality of target images from the original video frame data according to the time sequence so as to form target video frame data by the plurality of target images;

Respectively identifying a target object in each target image to obtain actual pixel coordinates of the target object;

sequentially determining predicted pixel coordinates of targets in other target images except the target image with the latest time in the target video frame data according to the time sequence; the predicted pixel coordinates of the target object in each other target image are determined based on the actual pixel coordinates of the target object in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model;

updating and acquiring a target depth value of a target object in the target image with the latest time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest time is acquired.

According to a second aspect of the present disclosure, there is provided a target depth estimation apparatus based on monocular vision, comprising:

the acquisition unit is used for acquiring original video frame data of an acquired target object acquired by a camera arranged on the vehicle in the running process of the vehicle; extracting a plurality of target images from the original video frame data according to the time sequence so as to form target video frame data by the plurality of target images;

The identification unit is used for respectively identifying the target object in each target image so as to acquire the actual pixel coordinates of the target object;

a prediction unit, configured to sequentially determine predicted pixel coordinates of a target object in other target images except for a target image with the latest time in the target video frame data according to a time sequence; the predicted pixel coordinates of the target object in each other target image are determined based on the actual pixel coordinates of the target object in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model;

the updating unit is used for updating and acquiring the target depth value of the target object in the target image with the latest time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in the other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest time is acquired.

According to a third aspect of the present disclosure, there is provided a vehicle control apparatus including a memory and a processor; wherein,,

The memory is used for storing a computer program;

the processor is configured to read the computer program stored in the memory, and execute the target depth estimation method based on monocular vision according to the first aspect according to the computer program in the memory.

According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the monocular vision-based target depth estimation method according to the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the monocular vision-based target depth estimation method according to the first aspect.

The target depth estimation method, device and vehicle based on monocular vision provided by the disclosure comprise the following steps: acquiring original video frame data of an acquired target object acquired by a camera arranged on a vehicle in the running process of the vehicle; extracting a plurality of target images from the original video frame data according to the time sequence so as to form target video frame data by the plurality of target images; respectively identifying the target object in each target image to obtain the actual pixel coordinates of the target object; sequentially determining predicted pixel coordinates of targets in other target images except the target image with the latest time in the target video frame data according to the time sequence; the predicted pixel coordinates of the targets in each other target image are determined based on the actual pixel coordinates of the targets in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model; updating the target depth value of the target object in the target image with the latest acquisition time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest acquisition time is acquired. In the target depth estimation method and device based on monocular vision and the vehicle, the actual pixel coordinates of the target object in the target image and the preset motion model can be utilized to determine the predicted pixel coordinates of the target object in the next adjacent target image; then, the obtained target depth value may be updated according to the distance between the actual pixel coordinates and the predicted pixel coordinates of the target object in the obtained target image. Compared with a space measurement mode, the method does not depend on pose data, only optimizes and solves a single parameter of a depth value, has small calculated amount and high speed, and can meet the real-time requirement.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a monocular vision-based target depth estimation method according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flow chart of a monocular vision-based target depth estimation method according to another exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of ordering the target images of FIG. 3 in a time-from-late order, as shown in an exemplary embodiment of the present disclosure;

FIG. 4 is a schematic diagram of ordering the 2 nd target image in time from late to early order, as shown in an exemplary embodiment of the present disclosure;

FIG. 5 is a schematic diagram of ordering target images of 1 in order of time from late to early, as shown in an exemplary embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a motion model shown in an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic view of a camera mounting angle shown in an exemplary embodiment of the present disclosure;

FIG. 8 is a diagram illustrating a monocular vision-based target depth estimation process according to an exemplary embodiment of the present disclosure;

FIG. 9 is a block diagram of a monocular vision-based target depth estimation device according to an exemplary embodiment of the present disclosure;

fig. 10 is a block diagram of a vehicle control apparatus shown in an exemplary embodiment of the present disclosure.

Detailed Description

In the prior art, target depth estimation is generally based on deep learning or spatial measurement. The former requires a large amount of image data for training and requires a true depth value to be used as a label; the latter relies on camera pose data for computation.

However, the spatial measurement method depends on camera pose data, so that the calculated amount is large, the calculation speed is low, and the real-time requirement is difficult to meet. The deep learning mode has high requirements on computing resources and data, and is complex to implement.

In order to solve the above technical problems, in the solution provided in the present disclosure, the actual pixel coordinates of the target object in the target image and the preset motion model may be used to determine the predicted pixel coordinates of the target object in the next adjacent target image; then, the obtained target depth value may be updated according to the distance between the actual pixel coordinates and the predicted pixel coordinates of the target object in the obtained target image. Compared with a space measurement mode, the method does not depend on pose data, only optimizes and solves a single parameter of a depth value, has small calculated amount and high speed, and can meet the real-time requirement. Compared with a deep learning mode, the method has the advantages that the required data size is greatly reduced, manual labeling is not needed, high requirements on computing resources and data are avoided, and the method is easy to realize.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) related to the present disclosure are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and be provided with corresponding operation entries for the user to select authorization or rejection.

The following describes the technical solutions of the present disclosure and how the technical solutions of the present disclosure solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present disclosure will be described below with reference to the accompanying drawings.

Fig. 1 is a flow chart illustrating a monocular vision-based target depth estimation method according to an exemplary embodiment of the present disclosure.

As shown in fig. 1, the target depth estimation method based on monocular vision provided in this embodiment includes:

step 101, acquiring original video frame data of an acquired target object acquired by a camera arranged on a vehicle in the running process of the vehicle; and extracting a plurality of target images from the original video frame data in time order to construct target video frame data from the plurality of target images.

The execution subject of the method provided by the present disclosure may be a vehicle control device.

The vehicle control equipment can acquire original video frame data of a target object acquired by a camera arranged on the vehicle in the running process of the vehicle.

Wherein, monocular vision-based in this scheme refers to vision perception based on a single camera provided on the vehicle.

The target object may be a stationary object on the roadside during the running of the vehicle. For example, two types of map elements, shaft and outline. Or other stationary objects in the road environment such as traffic signs, traffic lights, etc.

Specifically, a plurality of target images may be extracted from the original video frame data in a temporal order (for example, in a temporal order from late to early, or in a temporal order from early to late) to constitute target video frame data from the plurality of target images.

Alternatively, for example, the time of capturing the target image may be similar to the current time. The two preceding and following frames of target images may have the same time interval.

Step 102, identifying the target object in each target image respectively to obtain the actual pixel coordinates of the target object.

Specifically, the target object in each target image can be identified by using a preset mode, so that the actual pixel coordinates of the target object in each target image can be identified.

Step 103, according to the time sequence, determining the predicted pixel coordinates of the target object in other target images except the target image with the latest time in the target video frame data in sequence; the predicted pixel coordinates of the target object in each other target image are determined based on the actual pixel coordinates of the target object in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model.

Wherein the initial depth value characterizes an estimate of the longitudinal distance between the object and the camera when acquiring the object image with the latest time.

In particular, the initial depth value is empirically given, typically in the range of 10 to 60 meters, since at this distance the target object is typically already in the visible range of the camera.

The parameter information of the camera may include intrinsic information and extrinsic information of the camera. The acquired parameter information of the camera is calibrated information.

The traveling information of the vehicle may include speed information of the vehicle.

The preset motion model is preset according to actual conditions. The motion model is obtained by modeling the motion trail of the vehicle and the motion trail of the target object in the video.

Specifically, the predicted pixel coordinates of the target object in the target images other than the target image with the latest time in the target video frame data may be sequentially determined in time order (for example, in order from late to early, or in order from early to late).

Specifically, the predicted pixel coordinates of the object in each other object image are determined based on the actual pixel coordinates of the object in the previous object image adjacent to the predicted pixel coordinates, the obtained initial depth value, the parameter information of the camera, the driving information of the vehicle, and the preset motion model

Specifically, in theory, the depth estimation can be performed by using two frames of target images, but the more the number of frames involved in the calculation, the more accurate the obtained depth, and the calculation efficiency is reduced. For example, the target video frame data may include four target images, i.e., a target image 1, a target image 2, a target image 3, and a target image 4, which are arranged in order of acquisition time from late to early. The predicted pixel coordinates of the object in the object image 2 may be predicted from the actual pixel coordinates of the object in the object image 1, the predicted pixel coordinates of the object in the object image 3 may be predicted from the actual pixel coordinates of the object in the object image 2, and the predicted pixel coordinates of the object in the object image 4 may be predicted from the actual pixel coordinates of the object in the object image 3.

Step 104, updating the target depth value of the target object in the target image with the latest acquisition time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in the other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest acquisition time is acquired.

The target depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest acquisition time is acquired.

Specifically, the cost function may be determined according to a distance between an actual pixel coordinate and a predicted pixel coordinate of the target object in the obtained target image. And (3) optimally estimating the target depth, wherein the minimum solution of the cost function is used for minimizing the cost function by an optimization method, and the depth value is continuously changed in the process until the optimization times reach an upper limit or the change of the depth value is smaller than a certain range, so that the depth value at the moment can be determined as the target depth value. So far, the depth estimation process is finished, and the accuracy of the method is verified to be high.

The target depth estimation method based on monocular vision, provided by the disclosure, comprises the following steps: acquiring original video frame data of an acquired target object acquired by a camera arranged on a vehicle in the running process of the vehicle; extracting a plurality of target images from the original video frame data according to the time sequence so as to form target video frame data by the plurality of target images; respectively identifying the target object in each target image to obtain the actual pixel coordinates of the target object; sequentially determining predicted pixel coordinates of targets in other target images except the target image with the latest time in the target video frame data according to the time sequence; the predicted pixel coordinates of the targets in each other target image are determined based on the actual pixel coordinates of the targets in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model; updating the target depth value of the target object in the target image with the latest acquisition time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest acquisition time is acquired. In the method adopted by the disclosure, the actual pixel coordinates of the target object in the target image and a preset motion model can be utilized to determine the predicted pixel coordinates of the target object in the next adjacent target image; then, the obtained target depth value may be updated according to the distance between the actual pixel coordinates and the predicted pixel coordinates of the target object in the obtained target image. Compared with a space measurement mode, the method does not depend on pose data, only optimizes and solves a single parameter of a depth value, has small calculated amount and high speed, and can meet the real-time requirement.

Fig. 2 is a flow chart illustrating a monocular vision-based target depth estimation method according to another exemplary embodiment of the present disclosure.

As shown in fig. 2, the target depth estimation method based on monocular vision provided in this embodiment includes:

step 201, acquiring original video frame data of an acquired target object acquired by a camera arranged on a vehicle in the running process of the vehicle; and extracting a plurality of target images from the original video frame data in time order to construct target video frame data from the plurality of target images.

Specifically, the principle and implementation of step 201 are similar to those of step 101, and will not be described again.

Step 202, identifying a target object in each target image and a box where the target object is located; and determining the pixel coordinate of the intersection point of the two diagonals of the square as the actual pixel coordinate of the target object.

Specifically, after step 202, step 203, or step 204, may be performed.

Specifically, the target object in each target image can be identified by a preset mode, and the position of the target object in the target image is framed by a box, so that the position of the box is identified. Then, the diagonal lines of the block are connected, and the pixel coordinates of the intersection of the two diagonal lines are determined as the actual pixel coordinates of the target object.

Step 203, when the number of target images is M, determining the predicted pixel coordinates in the target image of the n+1 th order from the target video frame data according to the actual pixel coordinates of the target objects in the target image of the N-th order, the parameter information of the camera, the driving information of the vehicle and the preset motion model, adding 1 to N, and repeating the steps according to the sequence from the late to the early in time until the predicted pixel coordinates of the target objects in the target image of the M-th order in the target video frame number are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

Specifically, after step 203, step 205 may be performed.

Specifically, when the number of target images is M, determining the predicted pixel coordinates in the n+1-th target image according to the actual pixel coordinates of the targets in the N-th target image, the parameter information of the camera, the driving information of the vehicle and the preset motion model, and then adding 1 to N, and continuously repeating the steps from late to early according to the time until the predicted pixel coordinates of the targets in the M-th target image in the number of target video frames are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

In one implementation manner, m=3, and then determining, according to the order from late to early, the predicted pixel coordinates of the target object in the target image of the order 2 according to the obtained initial depth value, the parameter information of the camera, the driving information of the vehicle, and the preset motion model, wherein the actual pixel coordinates of the target object in the target image of the order 1.

Specifically, m=3 may be taken, and the predicted pixel coordinates of the target object in the target image of rank 2 may be determined according to the actual pixel coordinates of the target object in the target image of rank 1, the obtained initial depth value, the parameter information of the camera, the running information of the vehicle, and the preset motion model according to the order from late to early.

Optionally, the parameter information of the camera comprises pixel size, focal length, yaw angle, pitch angle and principal point coordinates; the running information of the vehicle includes vehicle speed information;

the predicted pixel coordinates of the object in the ordered 2 nd object image are determined by the following formula:

wherein z' represents an initial depth value; (x) ₀ ，y ₀ ) Representing actual pixel coordinates and dominant of objects in the ordered 1 st object imageDifferences between the point coordinates; (x) ₁ ′，y ₁ ') represents the difference between the predicted pixel coordinates and the principal point coordinates of the object in the ordered 2 nd object image; Δz ₁ Representing the distance of travel, Δz, of the vehicle from the time of acquisition of the target image of rank 2 to the time of acquisition of the target image of rank 1 ₁ Is determined according to the vehicle speed information and the time difference between the time of acquiring the target image of the sequence 2 and the time of acquiring the target image of the sequence 1; (f) _x ，f _y ) Representing the equivalent focal length, determined by dividing the focal length by the pixel size; θ represents a yaw angle; beta represents pitch angle.

In particular, the parameters of the camera may include pel size, focal length, yaw angle, pitch angle, principal point coordinates. The focal length is the focal length of the camera after calibration. The yaw angle and the pitch angle are external parameters of the camera and are calibrated. The main point coordinates are internal references of the camera, the main point coordinates after calibration are the main point coordinates, and the main point coordinates are the center points of a camera coordinate system.

Specifically, the running information of the vehicle may include vehicle speed information.

Specifically, Δz ₁ Representing the distance of travel, Δz, of the vehicle from the time of acquisition of the target image of rank 2 to the time of acquisition of the target image of rank 1 ₁ Is determined based on vehicle speed information (such as vehicle average speed) between the time of acquisition of the target image of rank 2 and the time of acquisition of the target image of rank 1, and the time difference between the time of acquisition of the target image of rank 2 and the time of acquisition of the target image of rank 1, which may be determined based on the number of frames of the difference between the target image of rank 1 and the target image of rank 2 and the time difference of every two adjacent frames.

Specifically, the equivalent focal length is determined by dividing the focal length by the pixel size.

And then, according to the actual pixel coordinates of the targets in the target images of the sequence 2, the obtained initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model, determining the predicted pixel coordinates of the targets in the target images of the sequence 3.

Specifically, according to the actual pixel coordinates of the targets in the target images of the sequence 2, the obtained initial depth value, the parameter information of the camera, the running information of the vehicle and the preset motion model, the predicted pixel coordinates of the targets in the target images of the sequence 3 can be determined. For example, taking a shaft in a road environment as an example, if the target object is a traffic sign lower shaft, the target image of rank 3 is shown in fig. 3, the target image of rank 2 is shown in fig. 4, and the target image of rank 1 is shown in fig. 5.

the predicted pixel coordinates of the object in the ordered 3 rd object image are determined by the following formula:

wherein z' represents an initial depth value; (x) ₁ ，y ₁ ) Representing the difference between the actual pixel coordinates and the principal point coordinates of the object in the ordered 2 nd object image; (x' ₂ ，y′ ₂ ) Representing the difference between the predicted pixel coordinates and the principal point coordinates of the object in the ordered 3 rd object image; Δz ₂ Representing the distance of travel, Δz, of the vehicle from the time of acquisition of the ordered 3 rd target image to the time of acquisition of the ordered 2 nd target image ₂ Is determined according to the vehicle speed information and the time difference between the time of acquiring the target image of the 3 rd sequence and the time of acquiring the target image of the 2 nd sequence; (f) _x ，f _y ) Representing the equivalent focal length, determined by dividing the focal length by the pixel size; θ represents a yaw angle; beta represents pitch angle.

Specifically, Δz ₂ Representing the distance of travel, Δz, of the vehicle from the time of acquisition of the ordered 3 rd target image to the time of acquisition of the ordered 2 nd target image ₂ For the purpose of sorting 3 rd according to vehicle from acquisition The vehicle speed information (such as the average vehicle speed) between the target image time and the time at which the target image of rank 2 was acquired, and the time difference between the target image time of rank 3 and the target image time of rank 2 was acquired (which may be determined from the number of frames that differ between the target image of rank 3 and the target image of rank 2 and the time difference of every two adjacent frames).

Optionally, the depth estimation needs to predict the pixel coordinates of the object in the next frame, and for this purpose, a motion model is built, and the pixel coordinates of the object in the current frame are taken as input. The pixel coordinates are obtained by taking the upper left corner of the image as the origin, so that the main point coordinates can be subtracted from the pixel coordinates, and the pixel coordinates can be converted into camera coordinates to participate in calculation.

As shown in fig. 6, the vehicle is assumed to be viewed from O in plan ₁ Point direction O ₂ Point travel, O ₁ A and O ₂ The straight line where B is located represents the optical axis of the camera, and the straight lines where AD and BE are located are respectively at O ₁ Point camera, O ₂ On the imaging plane of the camera at the point, the point P is the observation point (i.e. the position of the target object), and D, E is the observation point P at O ₁ 、O ₂ The projected point on the image plane is known as AD as X (i.e., the value in the X direction in the difference between the pixel coordinates of the object in the current frame and the principal point coordinates), O ₁ A and O ₂ B is f _x (i.e., the value in the X direction in the equivalent focal length), O ₁ O ₂ Let BE X '(i.e., the value in the X direction in the difference between the predicted pixel coordinates of the target in the next frame and the principal point coordinates) to obtain X' which is the distance traveled by the vehicle from the current frame time to the next frame time, and which is the distance traveled by the vehicle at point N, M by the auxiliary line perpendicular to the optical axis direction passing through point P ₂ The auxiliary line perpendicular to the optical axis is crossed with the point C to let O ₁ N is z. Considering the camera mounting angle, as shown in fig. 7, the camera coordinate system is at an angle difference in three directions of pitch, azimuth, roll with respect to the vehicle carrier coordinate system, wherein the dark color represents the camera coordinate system; the light color indicates the carrier coordinate system. Only the influence of azimuth angle is needed to be considered when solving x', namely the included angle of the optical axis and the XOZ plane of the carrier coordinate system is less than CO ₁ O ₂ Let the angle value be θ, represent the yaw angle of the camera.

Then there is the following relationship:

O ₂ C＝MN＝Δz sinθ (1)

O ₁ C＝Δz cosθ (2)

O ₂ M＝O ₁ N-O ₁ C (4)

MP＝MN+NP (5)

from equation (3):

from equations (2) (4)

O ₂ M＝z-Δz cosθ (8)

From formulas (1) (5) (7)

From equations (6) (8) (9)

I.e.

Is available in the same way

Wherein f _y Representing the value in the Y direction in the equivalent focal length; y represents a value in the Y direction in a difference between a pixel coordinate of the object in the current frame and a principal point coordinate; y' represents a value in the Y direction in a difference between the predicted pixel coordinates and the principal point coordinates of the object in the next frame; beta is the pitch angle, i.e. the angle between the camera optical axis and the plane XOY of the carrier coordinate system in fig. 7.

The above derived formula describes the process of predicting the target of the closer frame from the target of the farther frame during the traveling of the vehicle, and in practice, the depth of the nearest frame needs to be estimated, so that the pixel coordinates of the target of the farther frame are predicted from the closer frame, and the formula (11) is converted

At this time O ₁ N becomes unknown quantity, O ₂ M is a known quantity, let O ₂ M is z' (i.e., the initial depth value), then it is obtainable from equation (8)

z＝z′+Δz cosθ (14)

Substituting equation (14) into equation (13) can result in

Is available in the same way

At this time, X in the formula (15) represents a value in the X direction in a difference between the predicted pixel coordinates and the principal point coordinates of the target object in the current frame; x' represents a value in the X direction in a difference between a pixel coordinate of the object in a next frame of the current frame and a principal point coordinate.

Meanwhile, Y in the formula (16) represents a value in the Y direction in a difference between a predicted pixel coordinate of the target object in the current frame and the principal point coordinate; y' represents a value in the Y direction in a difference between a pixel coordinate of the object in a next frame of the current frame and a principal point coordinate.

Step 204, when the number of target images is M, determining the predicted pixel coordinates in the target image of the n+1 th order from the target video frame data according to the actual pixel coordinates of the target objects in the target image of the N-th order, the acquired initial depth value, the parameter information of the camera, the driving information of the vehicle and the preset motion model, adding 1 to N, and repeating the steps according to the sequence from the early to the late time until the predicted pixel coordinates of the target objects in the target image of the M-1 th order in the number of target video frames are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

Specifically, when the number of target images is M, determining the predicted pixel coordinates in the n+1th target image according to the actual pixel coordinates of the targets in the N-th target image, the parameter information of the camera, the driving information of the vehicle and the preset motion model, and then adding 1 to N, and continuously repeating the steps according to the sequence from the early to the late until the predicted pixel coordinates of the targets in the M-1 th target image in the N-th target image are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

Step 205, updating the target depth value of the target object in the target image with the latest acquisition time according to the predicted pixel coordinates and the actual pixel coordinates of the target object in the other target images; the depth value is used for representing an estimated value of the longitudinal distance between the target object and the camera when the target image with the latest acquisition time is acquired.

Specifically, as shown in fig. 8, if the initial depth value is close to the true value, the predicted pixel coordinate value will also approach the actual position of the rod in the graph, but since the initial depth value is given by the empirical value, there is usually a certain gap between the two, the distance between the predicted pixel coordinate and the actual pixel coordinate is used as the residual to optimize, so that the depth estimation is converted into the nonlinear least square optimization problem, and the cost function is that

Wherein F (z') represents a cost function; z' represents an initial depth value; (x) ₀ ，y ₀ ) Representing the difference between the actual pixel coordinates and the principal point coordinates of the object in the ordered 1 st object image in the order from late to early (i.e. the actual value x in fig. 8 ₀ ，y ₀ )；(x ₁ ，y ₁ ) Representing the difference between the actual pixel coordinates and the principal point coordinates of the object in the ordered 2 nd object image (i.e., the actual value x in FIG. 8 ₁ ，y ₁ )；(x ₁ ′，y ₁ ') represents the difference between the predicted pixel coordinates and the principal point coordinates of the object in the ordered 2 nd object image (i.e., the predicted value x in FIG. 8) ₁ ′，y ₁ ′)；(x ₂ ，y ₂ ) Representing the difference between the actual pixel coordinates and the principal point coordinates of the object in the ordered 3 rd object image (i.e. the actual value x in fig. 8 ₂ ，y ₂ )；(x ₂ ′，y ₂ ') represents the difference between the predicted pixel coordinates and the principal point coordinates of the object in the ordered 3 rd object image (i.e., the predicted value x in FIG. 8) ₂ ′，y ₂ ′)。

And (3) optimally estimating the target depth, wherein the minimum solution of the cost function is used for minimizing the cost function by an optimization method, and the depth value is continuously changed in the process until the optimization times reach an upper limit or the change of the depth value is smaller than a certain range, so that the depth value at the moment can be determined as the target depth value. So far, the depth estimation process is finished, and the accuracy of the method is verified to be high.

Fig. 9 is a block diagram of a monocular vision-based target depth estimation apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 9, the target depth estimation apparatus 900 based on monocular vision provided in the present disclosure includes:

an acquiring unit 910, configured to acquire original video frame data of an object acquired by a camera set on a vehicle during a driving process of the vehicle; extracting a plurality of target images from the original video frame data according to the time sequence so as to form target video frame data by the plurality of target images;

an identifying unit 920, configured to identify the target object in each target image, so as to obtain an actual pixel coordinate of the target object;

a prediction unit 930, configured to sequentially determine, according to a time sequence, predicted pixel coordinates of a target object in other target images except for a target image with a latest time in the target video frame data; the predicted pixel coordinates of the targets in each other target image are determined based on the actual pixel coordinates of the targets in the last target image adjacent to the predicted pixel coordinates, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model;

An updating unit 940, configured to update the target depth value of the target object in the target image with the latest acquisition time sequentially according to the predicted pixel coordinates and the actual pixel coordinates of the target object in the other target images in time sequence; wherein the depth value is used to characterize an estimate of the longitudinal distance between the object in the object image at the latest time and the camera.

The prediction unit 930 is specifically configured to determine, when the number of target images is M, the predicted pixel coordinates in the n+1-th target image from the target video frame data according to the actual pixel coordinates of the targets in the N-th target image, the obtained initial depth value, the parameter information of the camera, the driving information of the vehicle, and the preset motion model, add 1 to N, and repeat the steps according to the order from late to early until the predicted pixel coordinates of the targets in the M-th target image in the frame number of the target video frame are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

A prediction unit 930, specifically configured to determine, according to the order of time from late to early, the predicted pixel coordinates of the target object in the target image of rank 1 according to the obtained initial depth value, the parameter information of the camera, the driving information of the vehicle, and the preset motion model;

And determining the predicted pixel coordinates of the targets in the target images of the sequence 3 according to the actual pixel coordinates of the targets in the target images of the sequence 2, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and the preset motion model.

The prediction unit 930 is specifically configured to use parameter information of the camera including pixel size, focal length, yaw angle, pitch angle, and principal point coordinates; the running information of the vehicle includes vehicle speed information;

wherein z' represents an initial depth value; (x) ₀ ，y ₀ ) Representing the difference between the actual pixel coordinates and the principal point coordinates of the object in the ordered 1 st object image; (x) ₁ ′，y ₁ ') represents the difference between the predicted pixel coordinates and the principal point coordinates of the object in the ordered 2 nd object image; Δz ₁ Representing the distance of travel, Δz, of the vehicle from the time of acquisition of the target image of rank 2 to the time of acquisition of the target image of rank 1 ₁ Is determined according to the vehicle speed information and the time difference between the time of acquiring the target image of the sequence 2 and the time of acquiring the target image of the sequence 1; (f) _x ，f _y ) Representing the equivalent focal length, determined by dividing the focal length by the pixel size; θ represents a yaw angle; beta represents pitch angle.

The identifying unit 920 is specifically configured to identify the target object in each target image and a box where the target object is located;

and determining the pixel coordinate of the intersection point of the two diagonals of the square as the actual pixel coordinate of the target object.

The prediction unit 930 is further configured to determine, when the number of target images is M, the predicted pixel coordinates in the n+1th target image according to the actual pixel coordinates of the targets in the N-th target image, the parameter information of the camera, the driving information of the vehicle, and the preset motion model, from the target video frame data according to the order from the early to the late, add 1 to N, and repeat the steps according to the order from the early to the late until the predicted pixel coordinates of the targets in the M-1 th target image in the frame number of the target video frame are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

As shown in fig. 10, the vehicle control apparatus provided in this embodiment includes:

a memory 1001;

a processor 1002; and

a computer program;

wherein a computer program is stored in the memory 1001 and configured to be executed by the processor 1002 to implement any of the monocular vision based target depth estimation methods as described above.

The present embodiment also provides a computer-readable storage medium having stored thereon a computer program that is executed by a processor to implement any of the monocular vision-based target depth estimation methods described above.

The present embodiment also provides a computer program product comprising a computer program which, when executed by a processor, implements any of the above-described monocular vision-based target depth estimation methods.

The embodiment also provides a vehicle, which comprises the vehicle control device. The target depth estimation is realized through the vehicle control equipment, so that the running of the vehicle is controlled; wherein, the vehicle control equipment is the vehicle control equipment.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A monocular vision-based target depth estimation method, comprising:

2. The method according to claim 1, wherein when the target images are M, sequentially determining, in time order, predicted pixel coordinates of a target object in other target images except for the target image with the latest time in the target video frame data, includes:

determining predicted pixel coordinates in the target image of the (n+1) th order from the target video frame data according to the actual pixel coordinates of the target objects in the target image of the (N) th order, the acquired initial depth values, the parameter information of the camera, the running information of the vehicle and a preset motion model, adding 1 to N, and repeating the steps from the late to the early according to the time until the predicted pixel coordinates of the target objects in the target image of the (M) th order in the target video frame number are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

3. The method according to claim 2, wherein m=3, wherein the determining, from the target video frame data, the predicted pixel coordinates in the n+1-th ordered target image according to the actual pixel coordinates of the target object in the N-th ordered target image, the parameter information of the camera, the driving information of the vehicle, and the preset motion model, and adding N to 1, repeating the steps in the time from late to early until determining the predicted pixel coordinates of the target object in the M-th ordered target image in the target video frame number includes:

According to the sequence from late to early, determining the predicted pixel coordinates of the target objects in the target images of the sequence 2 according to the actual pixel coordinates of the target objects in the target images of the sequence 1, the acquired initial depth values, the parameter information of the camera, the running information of the vehicle and a preset motion model;

and determining the predicted pixel coordinates of the targets in the target image of the sequence 3 according to the actual pixel coordinates of the targets in the target image of the sequence 2, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model.

4. A method according to claim 3, wherein the parameter information of the camera comprises pixel size, focal length, yaw angle, pitch angle, principal point coordinates; the running information of the vehicle includes vehicle speed information;

wherein z' represents the initial depth value; (x) ₀ ，t ₀ ) Representing the difference between the actual pixel coordinates of the object in the ordered 1 st object image and the principal point coordinates; (x) ₁ ′，y ₁ ') represents the difference between the predicted pixel coordinates of the object in the ordered 2 nd object image and the principal point coordinates; Δz ₁ Representing the distance of travel, Δz, of the vehicle from the time the ordered 2 nd target image was acquired to the time the ordered 1 st target image was acquired ₁ For determining from the vehicle speed information and a time difference between the time of acquisition of the ordered 2 nd target image and the time of acquisition of the ordered 1 st target image; (f) _x ，f _y ) Representing an equivalent focal length, determined by dividing said focal length by the size of said picture element; θ represents the yaw angle; beta represents the pitch angle.

5. A method according to claim 3, wherein the parameter information of the camera comprises pixel size, focal length, yaw angle, pitch angle, principal point coordinates; the running information of the vehicle includes vehicle speed information;

wherein z' represents the initial depth value; (x) ₁ ，y ₁ ) Representing the difference between the actual pixel coordinates of the object in the ordered 2 nd object image and the principal point coordinates; (x) ₂ ′，y ₂ ') between the predicted pixel coordinates of the object in the object image of rank 3 and the principal point coordinatesIs the difference between (1); Δz ₂ Representing the distance of travel, Δz, of the vehicle from the time the ordered 3 rd target image was acquired to the time the ordered 2 nd target image was acquired ₂ For determining from the vehicle speed information and a time difference between the time of acquisition of the ordered 3 rd target image and the time of acquisition of the ordered 2 nd target image; (f) _x ，f _y ) Representing an equivalent focal length, determined by dividing said focal length by the size of said picture element; θ represents the yaw angle; beta represents the pitch angle.

6. The method of any one of claims 1-5, wherein the identifying the object in each object image to obtain actual pixel coordinates of the object comprises:

respectively identifying a target object in each target image and a box where the target object is located;

7. The method according to claim 1, wherein when the target images are M, determining, in order of time, predicted pixel coordinates of a target object in other target images than the target image with the latest time in the target video frame data sequentially, further comprises:

according to the sequence from the early to the late, determining the predicted pixel coordinates in the target image of the (N+1) th sequence from the target video frame data according to the actual pixel coordinates of the target objects in the target image of the (N) th sequence, the acquired initial depth value, the parameter information of the camera, the running information of the vehicle and a preset motion model, adding N to 1, and repeating the steps according to the sequence from the early to the late until the predicted pixel coordinates of the target objects in the target image of the (M-1) th sequence in the target video frame number are determined; wherein N, M is a positive integer, and N is initially 1, M is greater than 1.

8. A monocular vision-based target depth estimation apparatus, comprising:

9. A vehicle control apparatus comprising a memory and a processor; wherein,,

the memory is used for storing a computer program;

the processor being configured to read a computer program stored in the memory and to perform the method according to any of the preceding claims 1-7 according to the computer program in the memory.

10. A vehicle characterized by comprising a vehicle control device;

the target depth estimation is realized through the vehicle control equipment, so that the running of the vehicle is controlled; wherein the vehicle control device is the vehicle control device described in claim 9.

11. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of the preceding claims 1-7.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.