CN113468955A

CN113468955A - Method, device and storage medium for estimating distance between two points in traffic scene

Info

Publication number: CN113468955A
Application number: CN202110556836.XA
Authority: CN
Inventors: 萧允治; 王礼闻; 许伟林; 伦栢江; 李永智; 肖顺利; 陆允; 曾国强
Original assignee: Hong Kong Productivity Council
Current assignee: Hong Kong Productivity Council
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2021-10-01
Anticipated expiration: 2041-05-21
Also published as: CN113468955B

Abstract

The application relates to a method, a device and a storage medium for estimating a distance between two points in a traffic scene, wherein the method comprises the following steps: acquiring a video image sequence, wherein the video image sequence comprises a plurality of frames of images collected by a camera associated with a traffic scene; for each frame of image, determining one or more preset physical lengths related to the vehicle in the image by using a first depth learning model; for each preset physical length in the image, determining the distance weight of each corresponding pixel position according to the real length value of the preset physical length; interpolating the distance weight of the pixel position of the interest area in the traffic scene by using a second deep learning model to obtain the distance weight of each pixel position of the interest area; and determining the real distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest. Thereby, accurate distance estimation independent of camera external or internal parameters is achieved.

Description

Method, device and storage medium for estimating distance between two points in traffic scene

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a method, a device, and a storage medium for estimating a distance between two points in a traffic scene.

Background

Estimating distance from a captured video is a challenging task, since images captured by a camera do not provide distance information for each pixel. Recently, many technologies have provided solutions for distance estimation of road video.

In the related art, some methods calibrate intrinsic and extrinsic parameters of a camera based on a camera model. The intrinsic parameters include the focal length and center position of the camera, while the extrinsic parameters consist of a rotation matrix and a translation matrix. For example, some papers estimate the calibration parameters of the camera by finding three vanishing points. However, the vanishing point is detected from the running track of the vehicle, and the vehicle should not change lanes during the calibration. Meanwhile, it assumes that the road is straight and flat, and has considerable limitations in practical applications. Another approach is to estimate the extrinsic parameters of the camera by mapping the feature points in the image into real world coordinates. It requires that the intrinsic parameters are known and the accuracy is highly dependent on the required feature points. The method detects 10 key points of 10 cars and labels the real world coordinates of the key points. By utilizing the two-dimensional position and the three-dimensional real world coordinate in the image, the calibrated position information is transferred to the solution of the PNP problem, and the external parameters of the camera can be estimated. Other methods utilize lane marker information of known width to estimate the calibration parameters. However, lane markings differ from road to road, which requires additional marking effort.

Camera calibration is a challenging task, with the goal of finding a mapping function (i.e., a calibration matrix) from a two-dimensional (i.e., image) to a three-dimensional (i.e., real-world coordinates) space. These methods assume that the scene is positioned on a plane and estimate the calibration matrix based on an accurate vehicle model, or that there are rigorous assumptions (e.g., accurate labeling information).

In the related art, patent US 2007/0154068 a1 shows a method for estimating the distance between a running vehicle and a preceding vehicle by an onboard camera, which requires that the camera is installed parallel to the road surface and the focal distance is known, and which can estimate the distance interval between the camera and the vehicle according to the detected width of the vehicle in front. Patents EP1005234 a2 and US 6172601B1 disclose estimating the distance by the moving distance of the host vehicle. However, these methods are designed for in-vehicle cameras and are intended to calculate the distance separation to a front vehicle or obstacle. Furthermore, these methods have strict assumptions, e.g., the focal length and camera height must be known, which limits the use.

Disclosure of Invention

To solve the above technical problem or at least partially solve the above technical problem, a method, an apparatus, and a storage medium for estimating a distance between two points in a traffic scene are provided.

In a first aspect, the present application provides a method for estimating a distance between two points in a traffic scene, comprising: acquiring a video image sequence, wherein the video image sequence comprises a plurality of frames of images collected by a camera associated with a traffic scene; for each frame of image in the video image sequence, determining one or more preset physical lengths related to the vehicle in the image by using a first deep learning model; for each preset physical length in the image, determining a distance weight of each pixel position corresponding to the preset physical length according to a real length value of the preset physical length, wherein the distance weight represents the real length represented by the pixel position, and the distance weight comprises a horizontal weight and a vertical weight; interpolating distance weights of pixel positions of an interested Region (ROI) in a traffic scene by using a second deep learning model to obtain the distance weights of all the pixel positions of the interested Region; and determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest.

In some embodiments, interpolating distance weights for pixel locations of a region of interest in a traffic scene using a second deep learning model to obtain distance weights for respective pixel locations of the region of interest, includes: converting the input image from a color space to a feature space through a first convolution layer of a second deep learning model to obtain a first feature map of the input image; extracting a group of features with different scales of the first feature map through a group of feature extraction blocks of the second deep learning model; through a group of upsampling blocks of a second deep learning model, upsampling, fusing and amplifying the group of features with different scales, and outputting a second feature map with the same size as the first feature map; and inputting the second feature map into a distance estimation head of a second deep learning model, and outputting the distance weight of the pixel position in the region of interest.

In some embodiments, for each feature extraction block, extracting features comprises: outputting a first feature of the first input feature through a second convolution layer of the feature extraction block; mapping the first feature back to the first input feature through an deconvolution layer of the feature extraction block to obtain a second input feature; determining a difference between the first input feature and the second input feature, and inputting the difference into a third convolution layer of the feature extraction block; outputting a compensation item through a third convolution layer of the feature extraction block; and determining the output characteristic of the characteristic extraction block according to the compensation item and the first characteristic.

In some embodiments, for each upsampling block, the features are upsampled, fused, and amplified, including: preprocessing the input characteristics of the up-sampling block through a fourth convolution layer of the up-sampling block; performing up-sampling on the preprocessed features through a bilinear interpolation layer of an up-sampling block; and processing the upsampled features through a fifth convolution layer of the upsampling block to obtain the output features of the upsampling block.

In some embodiments, outputting distance weights for pixel locations in the region of interest by the distance estimation head comprises: outputting distance weights for pixel positions in the region of interest by a sixth convolutional layer and a seventh convolutional layer of the series of distance estimation heads, wherein an excitation function of the seventh convolutional layer compresses an output value to a range of 0 to 1 using a Sigmoid function.

In some embodiments, the second deep learning model is trained using at least one of a horizontal direction constraint defining a degree to which distance weights of horizontally adjacent pixel positions are similar, a vertical direction constraint defining a degree to which distance weights of pixel positions increase with a vertical rise distance weight of the pixel position, and a video consistency constraint defining a degree to which distance weights of different frame images are similar.

In some embodiments, the constraint term is a weighted average of a horizontal direction constraint, a vertical direction constraint, and a video conformance constraint.

In some embodiments, determining the real distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest includes: determining a reference origin point on an image acquired by a camera; for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is an accumulation of horizontal weights and the vertical coordinate is an accumulation of vertical weights; and determining the real distance between the two pixel positions according to the horizontal coordinate and the vertical coordinate of the two pixel positions.

In a second aspect, the present application provides a computer device, comprising: a memory, a processor, and a computer program stored on the memory and executable on the processor; the computer program, when being executed by the processor, realizes the steps of the method for estimating a distance between two points in a traffic scene of any of the above embodiments.

In a third aspect, the present application provides a computer-readable storage medium having stored thereon a program for estimating a distance between two points in a traffic scene, the program for estimating a distance between two points in a traffic scene being executed by a processor to implement the steps of the method for estimating a distance between two points in a traffic scene of any of the above embodiments.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: the method provided by the embodiment of the application realizes accurate distance estimation independent of external or internal parameters of the camera.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flowchart of an embodiment of a method for estimating a distance between two points in a traffic scene according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an embodiment of two-point coordinates provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of one embodiment of a preset physical length provided in an example of the present application;

FIG. 4 is a schematic diagram of one implementation of a distance weight graph provided in an example of the present application;

FIG. 5a is a schematic diagram of an example of an un-interpolated distance weight map provided by an embodiment of the present application;

FIG. 5b is a diagram illustrating an example of an interpolated distance weight map provided by an embodiment of the present application;

FIG. 6 is a block diagram illustrating an implementation of a second deep learning model (DEN) provided in an embodiment of the present application;

fig. 7 is a block diagram illustrating a structure of an implementation manner of a feature extraction block (FE) provided in an embodiment of the present application;

FIG. 8 is a block diagram of an implementation of an upsampling block (US) provided in an embodiment of the present application;

FIG. 9 is a block diagram of an embodiment of a Distance Estimation Head (DEH) according to an embodiment of the present disclosure;

FIG. 10 is a diagram illustrating an example of estimating a distance using distance weights provided by an embodiment of the present application; and

fig. 11 is a hardware structure diagram of an implementation manner of a computer device according to an embodiment of the present application.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for the convenience of description of the present application, and have no specific meaning by themselves. Thus, "module", "component" or "unit" may be used mixedly.

The embodiment of the application relates to a distance estimation method, which realizes automatic calculation of the distance between two positions in a captured video of a traffic scene. In the embodiment of the application, the distance is estimated by recording and analyzing robust prior information of vehicles appearing in the video, imitating human perception. The distance between any point in the road area is automatically calculated, thereby promoting the development of the existing traffic camera system, and the distance information of the real world is embedded into the shot video without additional measurement of the camera. The method can provide useful information for various applications, such as vehicle speed estimation, collision warning systems, intelligent traffic control, etc.

The present application provides a method for estimating a distance between two points in a traffic scene, as shown in fig. 1, the method includes steps S102 to S110.

Step S102, a video image sequence is obtained, wherein the video image sequence comprises a plurality of frames of images collected by a camera associated with a traffic scene.

As an example, a camera is fixedly arranged in a position near a traffic scene, by means of which camera an image of the traffic scene is captured.

Step S104, for each frame of image in the video image sequence, determining one or more preset physical lengths related to the vehicle in the image by using the first deep learning model. As one example, the preset physical length includes a wheel base, a length, and the like of the vehicle.

And step S106, for each preset physical length in the image, determining the distance weight of each pixel position corresponding to the preset physical length according to the real length value of the preset physical length. Wherein the distance weight represents a real length represented by the pixel location, the distance weight comprising a horizontal weight and a vertical weight.

Step S108, the distance weight of the pixel position of the interested area in the traffic scene is interpolated by using the second deep learning model, and the distance weight of each pixel position of the interested area is obtained.

Step S110, determining a real distance between any two pixel positions on the image collected by the camera according to the distance weight of each pixel position of the region of interest.

In some embodiments, in step S110, a reference origin point on the image captured by the camera is determined; for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is an accumulation of horizontal weights and the vertical coordinate is an accumulation of vertical weights; and determining the real distance between the two pixel positions according to the horizontal coordinate and the vertical coordinate of the two pixel positions.

In step S110, a reference origin is defined, which may be any point or a point designed by two significant intersecting lines on the road. It will serve as a reference point when building a real world distance map. In some embodiments, horizontal and vertical lines are found from the video using the Hough transform. Then, the intersection of the horizontal line and the vertical line is found, and one is selected as the above-mentioned reference origin.

At the upper partIn step S110, two given points (x) on the image can be determined_a,y_a) And (x)_b,y_b) A distance S between_ab. Taking a small area on the road (e.g. a rectangle in fig. 2) as an example, the area is small and can be modeled as a 2D plane. The plane is a plane that can be described by two orthogonal vectors (x-axis and y-axis in fig. 2). The distance (as a and b in fig. 2) between two points (as d in fig. 2) can be described by two orthogonal vectors (in x and y directions, fig. 2). In this way, we can store the real-world distance for each pixel location with two orthogonal elements (horizontal and vertical distance weights). Mathematically, the assumption can be written as:

here, the

And

are two orthogonal distance vectors in the x-axis and y-axis directions. Symbol

Representing a distance vector between points a and b.

Determining a horizontal coordinate and a vertical coordinate of the pixel position relative to the reference origin, wherein the horizontal coordinate is an accumulation of horizontal weights and the vertical coordinate is an accumulation of vertical weights. Knowing the real coordinates of all points, any two points (x) can be obtained_a,y_a) And (x)_b,y_b) The distance between:

wherein (x)_a,x_b) And (y)_a,y_b) Representing the real world coordinates of a and b.

In the embodiment of the present application, in step S104, for the captured image, the preset physical lengths of different objects, which continuously, repeatedly and continuously appear in the scene, such as the vehicle in fig. 3, are identified by using the first deep learning model. The standard width of a vehicle in the real world is designed to be 1.6-2.0m, and the wheel base length is about 2.6m according to the type of the vehicle. In step S106, the distances within the image are roughly estimated using these a priori knowledge.

For example, in step S104, the wheels of the vehicle can be easily found by detecting the vehicle from the image using Mask R-CNN. The preset physical length of the detected object is known. Because there are many lengths in the scene that occur continuously, repeatedly, and continuously in the scene. The vehicle repeatedly appears in the captured video segment. Their width is fixed in the real world. The width in the image varies consistently as the vehicle moves along the road direction. Likewise, the size of the wheelbase length is relatively fixed in the real world. The length continuously changes as the vehicle moves. These preset physical lengths can be used as a scale to measure the real world length in a small area in the image.

After finding the preset physical length associated with the vehicle in the image using the first deep learning model, in step S106, a ratio (referred to herein as a distance weight, denoted δ) between the real world (denoted L) and the length of the pixel (denoted L) is calculated. Mathematically, the distance weight is defined as:

according to the distance model in equation (3), each pixel position is regarded as an infinitesimal area. For each pixel position of the image, we define a distance weight δ to represent the ratio between the real world and the pixel length, i.e. the real length represented by each pixel position. For each frame of video, a first deep learning model is first used to identify an object of interest, and a preset physical length of interest (e.g., the width of a vehicle, the wheelbase length of the vehicle, etc.) is found from the image. And calculating the distance weight of the pixel position according to the preset physical length which is fixed and known in the real world in advance.

Taking fig. 4 as an example, a white vehicle is detected from a captured image using a first deep learning model. The width of the vehicle is counted according to the detected position, and the unit is 55 pixels. In the real world, the average width of the vehicle is about 1.8 m. A distance weight is calculated according to formula (3), and represents the relationship between the real world of the position of the vehicle on the image and the pixel length. Therefore, we record the weight values at the vehicle positions and create a graph (called a distance weight graph, see the right part of fig. 4). Similarly, weights are calculated for all vehicles (or other lengths of interest) in the image according to the same procedure and recorded on the distance weight map.

For a video sequence, each frame may be processed. An example is given in fig. 5a, where a distance weight map records distance weights for a sequence of video images. As shown in fig. 5a, a distance weight map of a set of recorded values is obtained through the previous processing. However, these values are sparse, that is, as shown in fig. 5a, a large portion of the region of interest does not have satisfactory results. There is a lot of space in the road area that requires weight information for interpolation.

After proper training, the Convolutional Neural Network (CNN) can be well interpolated and filled in the blank according to the designed rule. Therefore, in step S108 described above, the distance weight of the road region is interpolated using the second deep learning model. After the interpolation process, as shown in fig. 5b, each position of the weight map has a distance weight to represent the distance of the real world.

In the step S108, a deep learning model with interpolation capability is feasible, and a new second deep learning model is proposed in the embodiment of the present application.

An implementation of the second deep learning model according to the embodiment of the present application is described below.

Distance estimation is a high-level video understanding task that requires a fairly large field of acceptance to analyze context. The localization information is also crucial for refining the details of the distance map. Embodiments of the present application propose a second deep learning model, called Distance Estimation Network (DEN), which combines context and local information of global and local features to facilitate understanding of input images. In order to reduce the requirement on computing power in practical application, the system has a lightweight structure and is easy to deploy.

Distance estimation network (Distance) Estimation Network，DEN)

As shown in fig. 6, DEN is composed of an upper half path and a lower half path. On the upper path, a set of Feature Extraction blocks (FE for short) is included, configured to downsample the input image step by step, which extracts more and more global features for use in the subsequent process, as illustrated in fig. 6, including Feature Extraction blocks 201, 202 and 203. In the lower half path, a set of upsampling blocks (US for short) is included, configured to receive extracted features from different scales, and to continuously process, fuse, and magnify the features to restore them to their original size, as shown in fig. 6, including upsampling blocks 301, 302, and 303. Finally, two Distance Estimation Heads (DEH) 401 and 402, the Distance Estimation Head 401 being configured to predict Distance weights in the horizontal direction

(horizontal weight), the distance estimation head 402 is configured to predict the distance weight in the vertical direction

(vertical weight). As shown in fig. 6, the upper half path further includes a convolution layer 501 located before the feature extraction block, and configured to process the input image 101, convert the input image 101 from the color space to the feature space, and obtain a feature map 102; also included between the top half path and the bottom half path is a convolutional layer 502 configured to process the features output by the feature extraction block 203And (6) processing.

Mathematically, the function of DEN is described as:

wherein f (-) represents DEN. Symbol I ∈ R^W×H×3Representing an input image having three (i.e., RGB) channels. Output of

And

an estimated distance weight map for the horizontal and vertical directions, respectively.

In the above step S108, the input image 101 is converted from the color space to the feature space by the convolution layer 501, and the feature map 102 (having a scale of 640 × 0360 × 116, as an example) of the input image 101 (having a scale of 640 × 360 × 3, as an example) is obtained. By means of the feature extraction blocks 201, 202 and 203, a set of features of different scales of the feature map 102 is extracted, as shown in fig. 6, comprising

features

103, 104 and 105, which are, by way of example, of scales 360 × 2180 × 332, 180 × 490 × 564, 90 × 645 × 128, respectively. The feature map 109 having the same size as the feature map 102 is output by upsampling

blocks

301, 302, and 303, in which the upsampling block 301 processes the feature map 106 (having a scale of 90 × 45 × 256) output by the convolutional layer 502, the upsampling block 302 processes the feature map 107 (composed of the feature map 104 and the output of the upsampling block 301 and having a scale of 180 × 90(64+64)), and the upsampling block 303 processes the feature map 108 (composed of the feature map 103 and the output of the upsampling block 302 and having a scale of 360 × 180 × (32+ 32)). The feature map 109 (formed by the output of the feature map 102 and the upsampling block 303, having a scale of 640 x 360 x (16+16)) is input to the distance estimation heads 401 and 402, and the distance weight of the pixel position in the region of interest is output

And

feature extraction block (Feature) Extraction Block，FE)

The quality of extracted features has a large relationship with the performance of DEN, and therefore, efficiently extracting useful information from input images is one of the most important parts of DEN networks. In order to effectively extract features to complete the distance estimation task, the embodiment of the present application provides a feature extraction block, as shown in fig. 7, the Input (Input) is a feature map with a size of 640 × 360 × 16, which we denote as X. First, enter convolutional layer E1, extract features with more global information (size 360 × 180 × 32)

To evaluate the extracted features

An deconvolution layer D acting on the feature

Mapping back to original size

(estimated in the input domain). By calculation of

The original X and estimate can be obtained

The difference of (a). The difference value represents the extracted feature

The quality of (c). Because of a good one

Should give a small estimate in the input fieldError of meter R_XBased on the measured difference R_XEstimating a compensation term R by a convolution layer E2_Y. By a compensation term R_YEnhancing extracted features

Mathematically, the process of the feature extraction block is described as:

the feature extraction block sequentially extracts multi-scale features from the input image, which extracts four different-scale features from the top half of the path, as shown in fig. 6. Features of different scales contain different semantic level information.

In some embodiments, for each feature extraction block, extracting features comprises: outputting the features of the input features X by the convolution layer E1 of the feature extraction block

Feature extraction by deconvolution layer D of feature extraction blocks

Mapping back to the input features X to obtain features

Determining feature X and feature

Difference R between_XThe difference R_XInputting a convolution layer E2 of the feature extraction block; outputting a compensation term R through the convolution layer E2_Y(ii) a According to the compensation term R_YAnd features

Determining output features of feature extraction blocks

Up-sampling block (Up-samplingBlock, US)

The lower half of the path uses a set of upper sampling block samples to process, fuse and amplify the features step by step, restoring them to the original size. As shown in fig. 8, the upper sampling block (US) includes: one convolutional layer E3 preprocesses features, one Bilinear interpolation layer (Bilinear) B upsamples features, and the other convolutional layer E4 processes the upsampled features.

In some embodiments, for each upsampling block, the features are upsampled, fused, and amplified, including: the input features 81 of the upsampled block (for example, with a scale of 360 × 180 × (32+32)) are preprocessed by its convolution layer E3; upsampling the preprocessed features 82 (illustratively, on the scale of 360 x 180 x 32) by a bilinear interpolation layer B of the upsampling block; the upsampled features 83 (illustratively, having a scale of 640 x 360 x 32) are processed by the convolution layer E4 of the upsampled block, resulting in output features 84 (illustratively, having a scale of 640 x 360 x 16) of the upsampled block.

Distance Estimation head (Distance Estimation) Head，DEH)

Through the characteristic extraction and the upsampling process, the obtained characteristic diagram contains rich distance estimation information. DEH is designed to predict distance weight maps from the obtained features. As shown in fig. 9, each DEH has convolutional layers E5 and E6, where the excitation function of the second convolutional layer E6 uses Sigmoid function to compress the output value to the range of 0 to 1. By way of example, as shown in FIG. 9, the scale of the input features 91 is 640 × 360 × (16+16), the scale of the convolutional layer E5 output features 92 is 640 × 360 × 32, and the scale of the convolutional layer E6 output features 93 is 640 × 360 × 1. The reason for using the Sigmoid function is: (a) the distance weight represents the ratio between the real world distance and the pixel length, always positive. (b) Too large a value is not useful in the application. If the object is too far away, it is difficult to detect and measure, resulting in large errors. (c) Labeled distance map (

And

) Are discrete. The difference of the marked positions is mainly measured in the training process. Due to the fact that the labeling positions of different frames are different, losses of different frames are different greatly, and gradient in the training process is unstable. By using the Sigmoid function, when the estimated value is too small or too large, the gradient has a small change (close to zero), thereby filtering the noise of the gradient and stabilizing the training process.

In some embodiments, outputting distance weights for pixel locations in the region of interest by the distance estimation head comprises: the two convolutional layers E5 and E6 in series of the distance estimation head output the distance weights of the pixel positions in the region of interest, where the excitation function of the second convolutional layer E6 uses Sigmoid function to compress the output values to the range of 0 to 1.

Constraint conditions

And the priori knowledge of the traffic video is utilized to assist the training of the deep learning model. The embodiment of the application defines a constraint item, which consists of three parts: constraint omega in horizontal direction_hConstraint omega in the vertical direction_vAnd video conformance constraint omega_vid. Mathematically, the constraint term is defined such that the constraint term will optimize the deep learning model during the training phase as part of the loss function.

Wherein λ₁,λ₂And λ₃Are coefficients that balance the three terms.

Constraint omega in horizontal direction_h: the embodiment of the application considers that the position of the traffic camera always faces to the road, and the direction of the road is mainly longitudinal. The distance weight (i.e., the ratio of the measured distance to the pixel distance) along the road direction (vertical direction) varies significantly, and the adjacent position in the horizontal direction is close to the weight of the camera. In order to restrict the weight change in the horizontal direction, the embodiment of the present application defines a horizontal restriction term Ω_hThe following were used:

where W and H represent the width and height of the video frame. Symbol

Representing the estimated distance weight. The terms i and j are indices in the horizontal and vertical directions, and the origin of the pixel index is located in the upper left corner of the image.

Constraint omega in vertical direction_v: in one frame of traffic video, the sky (if present) is always at the top, while the ground is at the bottom. For the pixels of the same column, the ratio of the pixel at the top position to the physical value is always larger than the pixel at the bottom, which results in the distance weight δ increasing with the vertical rise of the position. Based on this property, we define a vertical direction Ω_vThe constraint term of (2) is as follows:

video consistency constraint omega_vid: the position of the road camera is fixed. The captured video always contains a phaseThe same scenario. Therefore, video frames from the same video sequence should share the same distance weight map Δ. For each iteration of the training process, the distance estimation network of the embodiment of the present application optimizes the gradients of a batch of video frames from the same video sequence. Estimated weight map for different frames

It should be the same that the embodiments of the present application define the constraints as follows:

where L ∈ {1, 2.,. L } represents a sample index in a batch of video frames.

Fig. 10 gives an example of the proposed method. In this example, a short video sequence is collected from a scene and two distance weight maps in the horizontal and vertical directions are created using our method, as shown in fig. 10 (a) and (b). With one corner of the stop line as a reference point (see the dot in fig. 10 (c)), the distance therefrom is calculated. Each circle in the figure represents ten meters. The distance from the reference point is indicated by a "contour line". The distance radioactivity increase is estimated and the "contour" forms a regular ellipse. As shown in (d) of fig. 10, sample distances between some points and points are calculated, which indicates that the embodiments of the present application can accurately estimate the distance between any points of the road region.

In the embodiment of the present application, each pixel position is regarded as an infinitesimal area, wherein the road of the local area can be modeled as a small two-dimensional plane. An undulating road may be described by a set of two-dimensional planes. A reference origin for a scene is determined. And determining one to several preset physical lengths in the scene, wherein the preset physical lengths continuously appear, repeatedly appear and are consistent in the scene, so that the statistical length is kept unchanged. And mapping the obtained preset physical length with pixels on the image obtained by the camera. A sketch of the actual distance is formed on the map. A deep learning method is used to find a pair of weighted scores for each pixel on the image within the region of interest, forming a weight map of the scene. Any deep learning model with interpolation capability can be used to construct the weight map of the scene, which is also an advantage of this method. However, a special deep learning architecture (DEN network) is also proposed, whose loss function takes care of the consistency of the x-and y-coordinate values. In order to train the deep learning network for distance estimation, it uses a novel set of constraints, forcing the network to interpolate uncovered points using natural reasoning. Using the weight map, the coordinates of all pixels in the RoI can be estimated with reference to the origin; and the distance between the two points is calculated according to the coordinates of the points, and the precision is high.

The embodiment of the application also provides computer equipment. Fig. 11 is a schematic hardware structure diagram of an implementation manner of a computer device provided in an embodiment of the present application, and as shown in fig. 11, the computer device 10 according to the embodiment of the present application at least includes, but is not limited to: a memory 11 and a processor 12 communicatively coupled to each other via a system bus. It should be noted that fig. 11 only shows a computer device 10 with components 11-12, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the memory 11 (i.e., a readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 11 may be an internal storage unit of the computer device 10, such as a hard disk or a memory of the computer device 10. In other embodiments, the memory 11 may also be an external storage device of the computer device 10, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 10. Of course, the memory 11 may also include both internal and external storage devices of the computer device 10. In this embodiment, the memory 11 is generally used for storing an operating system and various types of software installed in the computer device 10. Further, the memory 11 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 12 is generally operative to control overall operation of the computer device 10. In this embodiment, the processor 12 is configured to execute program code stored in the memory 11 or to process data, such as a method for estimating a distance between two points in a traffic scene.

The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiments is for storing program code for a method for estimating a distance between two points in a traffic scene, which when executed by a processor implements the method for estimating a distance between two points in a traffic scene.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for estimating a distance between two points in a traffic scene, comprising:

acquiring a video image sequence, wherein the video image sequence comprises a plurality of frames of images collected by a camera associated with the traffic scene;

for each frame of image in the video image sequence, determining one or more preset physical lengths related to the vehicle in the image by using a first deep learning model; for each preset physical length in the image, determining a distance weight of each pixel position corresponding to the preset physical length according to a real length value of the preset physical length, wherein the distance weight represents the real length represented by the pixel position, and the distance weight comprises a horizontal weight and a vertical weight;

interpolating the distance weight of the pixel position of the interest area in the traffic scene by using a second deep learning model to obtain the distance weight of each pixel position of the interest area;

and determining the actual distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest.

2. The method of claim 1, wherein interpolating distance weights for pixel locations of a region of interest in the traffic scene using a second deep learning model to obtain distance weights for respective pixel locations of the region of interest comprises:

converting an input image from a color space to a feature space through a first convolution layer of the second deep learning model to obtain a first feature map of the input image;

extracting a set of features of different scales of the first feature map through a set of feature extraction blocks of the second deep learning model;

through a group of upsampling blocks of the second deep learning model, upsampling, fusing and amplifying the group of features with different scales, and outputting a second feature map with the same size as the first feature map;

and inputting the second feature map into a distance estimation head of the second deep learning model, and outputting a distance weight of a pixel position in the region of interest.

3. The method of claim 2, wherein for each feature extraction block, extracting features comprises:

outputting a first feature of a first input feature through a second convolution layer of the feature extraction block;

mapping the first feature back to the first input feature through the deconvolution layer of the feature extraction block to obtain a second input feature;

determining a difference between the first input feature and the second input feature, the difference being input into a third convolution layer of the feature extraction block;

outputting a compensation term through a third convolution layer of the feature extraction block;

determining an output feature of the feature extraction block according to the compensation term and the first feature.

4. The method of claim 2, wherein upsampling, fusing, and amplifying features for each upsampled block comprises:

preprocessing input features of an upsampling block through a fourth convolution layer of the upsampling block;

performing upsampling on the preprocessed features through a bilinear interpolation layer of the upsampling block;

and processing the upsampled features through a fifth convolution layer of the upsampling block to obtain the output features of the upsampling block.

5. The method of claim 2, wherein outputting, by the distance estimation head, distance weights for pixel locations in the region of interest comprises:

outputting distance weights for pixel locations in the region of interest by a sixth convolutional layer and a seventh convolutional layer of the series of distance estimation heads, wherein an excitation function of the seventh convolutional layer compresses an output value to a range of 0 to 1 using a Sigmoid function.

6. The method according to any one of claims 1 to 5, wherein the second deep learning model is trained with at least one of a horizontal direction constraint defining a degree to which distance weights of horizontally adjacent pixel positions are approximated, a vertical direction constraint defining a degree to which distance weights of pixel positions are increased with a vertical rising distance weight, and a video consistency constraint defining a degree to which distance weights of different frame images are approximated, as constraint terms.

7. The method of claim 6, wherein the constraint term is a weighted average of the horizontal direction constraint, the vertical direction constraint, and the video conformance constraint.

8. The method according to any one of claims 1 to 5, wherein determining the real distance between any two pixel positions on the image acquired by the camera according to the distance weight of each pixel position of the region of interest comprises:

determining a reference origin on an image acquired by the camera;

for each of any two pixel positions, determining a horizontal coordinate and a vertical coordinate of the pixel position relative to a reference origin, wherein the horizontal coordinate is an accumulation of horizontal weights and the vertical coordinate is an accumulation of vertical weights;

and determining the real distance between the two pixel positions according to the horizontal coordinate and the vertical coordinate of the two pixel positions.

9. A computer device, characterized in that the computer device comprises:

a memory, a processor, and a computer program stored on the memory and executable on the processor;

the computer program implementing the steps of the method for estimating a distance between two points in a traffic scene as claimed in any one of claims 1 to 8 when executed by the processor.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a program for estimating a distance between two points in a traffic scene, which program, when being executed by a processor, carries out the steps of the method for estimating a distance between two points in a traffic scene as set forth in any one of the claims 1 to 8.