CN117635730A

CN117635730A - Binocular distance measurement method and system based on depth stereo matching algorithm

Info

Publication number: CN117635730A
Application number: CN202311617480.1A
Authority: CN
Inventors: 孔群; 张桂文; 王付祥; 王雪
Original assignee: Qilu Institute of Technology
Current assignee: Qilu Institute of Technology
Priority date: 2023-11-30
Filing date: 2023-11-30
Publication date: 2024-03-01

Abstract

The application discloses a binocular distance measuring method and a binocular distance measuring system based on a depth stereo matching algorithm, which are used for acquiring checkerboard image pairs and live-action stereo image pairs of the same scene at different angles and different distances; performing error analysis on the checkerboard image pair to obtain camera calibration parameters; correcting the live-action stereoscopic image pair through the camera calibration parameters so that two images in the live-action stereoscopic image pair are positioned on the same plane and are parallel to each other; obtaining a parallax image of the corrected live-action stereoscopic image pair by constructing a stereoscopic matching network; and calculating scene depth information of the parallax map to obtain a measurement distance. Respectively carrying out multi-scale feature fusion on two images in an input live-action stereo image pair in a stereo matching stage, introducing self-attention and cross-attention mechanisms, and improving the matching quality of a weak texture region; and during parallax calculation, a gating loop unit is used for parallax iterative updating, so that a more accurate parallax image is obtained.

Description

Binocular distance measurement method and system based on depth stereo matching algorithm

Technical Field

The application relates to the technical field of computer vision, in particular to a binocular distance measuring method and system based on a depth stereo matching algorithm.

Background

At present, the popular ranging technology on the market comprises the ranging technology based on physical equipment such as ultrasonic waves, laser radars and the like, can realize remote ranging and can reach very high precision, but the required cost price is very high, and the method is only suitable for the specific fields such as military and the like, and has lower utilization rate in our daily life. Meanwhile, although the ultrasonic ranging technology is simple to operate, the error is large, and the laser radar ranging technology is poor in robustness. Compared with the distance measurement technology, the binocular vision distance measurement technology can calculate the three-dimensional information of the target object only through the left and right stereo image pairs shot at the same time, the distance measurement precision and the measurement error are only related to the used algorithm, and the hardware equipment is not excessively required. Through the development for many years, the binocular vision ranging technology has wide application in our life due to low cost and simple implementation mode.

The binocular distance measurement technology comprises five steps of image acquisition, camera calibration, stereo correction, stereo matching and depth estimation, wherein depth information is closely related to the step of stereo matching, the lower the error matching rate of a parallax image is, the better the effect of a converted depth image is, and the closer the final measured distance is to the actual distance. Therefore, stereo matching is a core link and is also the most challenging link, and the quality of the result directly influences the subsequent depth estimation.

With the development of the deep learning algorithm, the deep learning algorithm represented by convolutional neural networks (Convolutional Neural Networks, CNNs) is largely applied to the field of binocular stereo matching, thereby improving the parallax estimation accuracy, and significantly exceeding the conventional method. However, dense matching based on Convolutional Neural Networks (CNNs) has a limited receptive field and may not distinguish blurred regions.

Disclosure of Invention

In order to solve the technical problems, the application provides the following technical scheme:

in a first aspect, an embodiment of the present application provides a binocular ranging method based on a depth stereo matching algorithm, which is characterized by including:

acquiring checkerboard image pairs and live-action stereoscopic image pairs of the same scene at different angles and different distances;

performing error analysis on the checkerboard image pair to obtain camera calibration parameters;

correcting the live-action stereoscopic image pair through the camera calibration parameters so that two images in the live-action stereoscopic image pair are positioned on the same plane and are parallel to each other;

obtaining a parallax image of the corrected live-action stereoscopic image pair by constructing a stereoscopic matching network;

and calculating scene depth information of the parallax map to obtain a measurement distance.

In one possible implementation manner, the performing error analysis on the checkerboard image pair to obtain a camera calibration parameter includes:

determining a camera calibration tool, wherein the camera calibration tool adopts a binocular camera calibration program carried by an MATLAB platform;

according to the camera calibration tool, carrying out camera calibration on the checkerboard image pairs, and screening out the checkerboard images with the re-projection errors larger than 0.1 to obtain the camera calibration parameters;

the camera calibration parameters include an internal parameter of the camera including a focal length (fx, fy) of the camera, an imaging origin (cx, cy), an external parameter of the camera including a rotation matrix R and a translation vector T, and a distortion parameter of the camera including radial distortion (K1, K2, K3) and tangential distortion (P1, P2).

In one possible implementation, the correcting the pair of live-action stereoscopic images by the camera calibration parameter so that two images in the pair of live-action stereoscopic images are located on the same plane and parallel to each other includes:

and based on the camera calibration parameters, using a rectifyStereoImageS () function in Opencv to eliminate rotation and projection distortion between the live-action stereoscopic image pair, so that two images in the live-action stereoscopic image pair are aligned in the horizontal direction.

In one possible implementation, the stereo matching network includes: the device comprises a feature extraction unit, a feature conversion and feature fusion unit, a similarity calculation unit and a parallax iteration update unit;

the feature extraction unit is used for respectively extracting initial features of different scales from the corrected live-action stereo image pairs, and then acquiring a first feature map and a second feature map by using a residual dense block RDB;

the feature conversion and feature fusion unit is used for introducing a position coding and attention mechanism in a transform algorithm, converting the first and second feature graphs into first and second features depending on context and position correlation, and then realizing feature fusion of different scales through a fusion block FB to obtain first and second trans-scale features;

the similarity calculation unit is used for performing feature similarity calculation on the first and second trans-scale features through inner product operation to construct a feature correlation body and further construct a multi-layer correlation pyramid;

the parallax iterative updating unit is used for leading the gate control loop unit GRU to carry out parallax iterative updating to obtain a final parallax image.

In one possible implementation manner, the extracting initial features of different scales from the corrected pair of live-action stereo images respectively, and then obtaining the first and second feature maps by using the residual dense block RDB includes:

extracting initial features of different scales from the corrected live-action stereo image pair by using convolution layers with different convolution kernel sizes;

using different numbers of RDBs for the initial features of different scales to obtain a first feature mapAnd a second feature map/>The formula for acquiring the first characteristic diagram and the second characteristic diagram is as follows:

in the method, in the process of the invention,representing the operation of cascading RDBs on the ith branch, when i=1, 2,3, the number of RDBs is 3,2,1, conv, respectively _{(2i+1)×(2i+1)} The first convolution layer with a convolution kernel size of (2i+1) x (2i+1) on the ith branch is shown.

In one possible implementation manner, the introducing a position coding and attention mechanism in a transform algorithm converts the first and second feature maps into first and second features dependent on context and position, and then realizes feature fusion of different scales through a fusion block FB to obtain first and second trans-scale features, including:

for the first feature mapAnd said second profile +.>Adding position coding and attention mechanisms to form the first feature +.>And second feature->

By applying the first feature toAnd said second feature->Conversion to first trans-scale feature->And a second trans-scale feature->

The conversion formula and the fusion formula of the feature conversion and feature fusion unit are respectively as follows:

in the method, in the process of the invention,representing the operation of adding attention mechanism and position coding on the ith branch,/for>Representing the operation of the ith branch FB.

In one possible implementation manner, the calculating the feature similarity of the first and second trans-scale features through inner product operation, constructing a feature correlation body, and further constructing a multi-layer correlation pyramid, including:

characterizing the first cross-scale featureAnd said second trans-scale feature +.>And (3) performing feature similarity calculation through inner product operation to construct a feature correlation body C, wherein the formula is as follows:

constructing a multi-layer correlation pyramid based on C through Concat operation, wherein displacement information about pixels is provided in the correlation pyramid;

using the current disparity estimate, pixels are retrieved from each level of the correlation pyramid to generate a feature map using RAFT lookup rules in a correlation lookup.

In a possible implementation manner, the introducing the gate control loop unit GRU to perform parallax iterative update to obtain a final parallax map includes:

GRU from initial disparity d ₀ Starting with =0 to estimate the disparity sequence { d } ₁ ,d ₂ ,...,d _n An update direction Δd is generated in each iteration, which is fed back into the correlation pyramid in the next iteration to perform a search and be applied to the current disparity estimation: d, d _k+1 ＝Δd+d _k Updating for the last time to obtain the final parallax;

the parallax estimation is to input the first feature map, the feature similarity and the parallax estimated last time into a GRU, the GRU updates a hidden state, and a new parallax is estimated through the updated hidden state;

calculating the similarity of each feature under different scales by using a cascade method of similarity calculation and parallax updating, combining the context features, and finally carrying out relevant searching and updating through the highest resolution of GRU in multiple resolutions;

the GRUs are cascaded by using hidden states of each other as input, feature mapping is performed under different resolutions, information is transmitted between GRUs with adjacent resolutions by using an upsampling operation, searching is performed from the related pyramid under the highest resolution, and then next parallax estimation is performed;

using L ₁ The loss function trains the whole network, and the target loss function of the parallax estimation is as follows:

wherein d _i Is the disparity prediction sequence { d ] in an iterative process ₁ ,d ₂ ,...,d _n }，d _gt For a true disparity map, γ is defined as 0.9 such that the weight increases exponentially.

In one possible implementation manner, the calculating scene depth information of the disparity map to obtain a measurement distance includes:

the depth estimation formula is:

due to parallax d=x _L -x _R The formula can be sorted as:

the available depth distance Z is:

wherein f is the focal length of the binocular camera, B is the baseline distance between the binocular cameras, and d is the parallax of the object points of the left and right cameras.

In a second aspect, an embodiment of the present application provides a binocular ranging system based on a depth stereo matching algorithm, including:

the image acquisition module is used for acquiring checkerboard image pairs and live-action stereoscopic image pairs of the same scene at different angles and different distances;

the camera calibration module is used for carrying out error analysis on the checkerboard image pairs to obtain camera calibration parameters;

the stereoscopic correction module is used for correcting the real stereoscopic image pair through the camera calibration parameters so that two images in the real stereoscopic image pair are positioned on the same plane and are parallel to each other;

the stereoscopic matching module is used for obtaining a parallax image of the corrected live-action stereoscopic image pair by constructing a stereoscopic matching network;

and the depth calculation module is used for calculating scene depth information of the parallax map and obtaining a measurement distance.

In the embodiment of the application, the three-dimensional matching stage respectively carries out multi-scale feature fusion on two images in the input live-action three-dimensional image pair, and simultaneously introduces self-attention and cross-attention mechanisms, so that multi-scale single-view information can be captured, semantic information between the two images can be obtained, and the matching quality of a weak texture region is effectively improved; and when the parallax is calculated, the gate control loop unit is used for carrying out parallax iterative updating, so that a more accurate parallax image is obtained, and high-precision ranging is further realized.

Drawings

Fig. 1 is a schematic diagram of a binocular ranging method based on a depth stereo matching algorithm according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an RDB module structure according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a stereo matching network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a cross-scale connection provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of similarity calculation according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a related search structure according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a parallax iterative updating unit provided in an embodiment of the present application;

fig. 8 is a schematic diagram of a binocular distance measurement system based on a depth stereo matching algorithm according to an embodiment of the present application.

Detailed Description

The present invention is described below with reference to the drawings and the detailed description.

Fig. 1 is a schematic flow chart of a binocular ranging method based on a depth stereo matching algorithm provided in an embodiment of the present application, referring to fig. 1, the binocular ranging method based on the depth stereo matching algorithm in the embodiment includes:

s101, checkerboard image pairs and live-action stereoscopic image pairs of the same scene at different angles and different distances are obtained.

In the embodiment, pictures of real scenes are shot by the Haikang visual IC6 camera, and the optical axes of the left camera and the right camera are kept parallel in the construction process so as to reduce errors; the checkerboard calibration plate is made of glass and comprises 11 multiplied by 8 corner points, each lattice size is 45mm, the precision error is +/-0.01 mm, and the calibration plate extends over the whole image area as much as possible so as to obtain more accurate camera parameters.

The training sample stereo image pair of the stereo matching network is a Scene Flow data set, the migration sample stereo image pair is a live-action stereo image pair photographed by a binocular camera, and all stereo image pairs are corrected, namely, only have offset in the horizontal direction and have no offset in the vertical direction, so as to improve the matching efficiency of the feature points.

S102, performing error analysis on the checkerboard image pair to obtain camera calibration parameters.

The camera calibration is to analyze errors of checkerboard image pairs through a camera calibration tool, and the acquired camera calibration parameters are stored for standby after calibration is completed. In the embodiment, the camera calibration tool adopts a binocular camera calibration program of an MATLAB platform, the calibration program is derived according to a checkerboard calibration method of Zhang Zhengyou, a checkerboard image pair is imported, the checkerboard size is input, camera calibration is carried out, and checkerboard images with a reprojection error larger than 0.1 are screened out in the calibration process, so that camera calibration parameters are obtained.

The intrinsic parameters of the camera refer to parameters related to the characteristics of the camera, including the focal length (fx, fy) of the camera and the imaging origin (cx, cy); the external parameters of the camera refer to parameters in the world coordinate system, including a rotation matrix R and a translation vector T; the distortion parameters of the camera include radial distortion (K1, K2, K3) and tangential distortion (P1, P2).

S103, correcting the real stereoscopic image pair through the camera calibration parameters, so that two images in the real stereoscopic image pair are positioned on the same plane and are parallel to each other.

And (3) taking camera calibration parameter information as input, and using a recoffyStereoimages () function in Opencv to eliminate rotation and projection distortion between the live-action stereoscopic image pair, so that two images in the live-action stereoscopic image pair are aligned in the horizontal direction.

And S104, obtaining a parallax image of the corrected live-action stereoscopic image pair by constructing a stereoscopic matching network.

The three-dimensional matching network is constructed by the following steps: the device comprises a feature extraction unit, a feature conversion and feature fusion unit, a similarity calculation unit and a parallax iteration update unit.

Feature extraction unit: taking view features rich in multi-scale features into consideration, different-scale branch structures are adopted. Firstly, respectively extracting three-scale initial features from a corrected live-action stereo image by using convolution layers with convolution kernel sizes of 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7, and then obtaining a first feature map by using different numbers of RDB modules for the initial features with different scalesAnd a second characteristic map->So that different receptive field features can be obtained at different scales. Referring to FIG. 2, the RDB block consists of 4 convolution layers of convolution kernel size 3×3 and LReLu activation function, channel multiplicationWith a length rate of 16, in the RDB module, each layer inputs the output from the previous layer and features of all previous layers, the output is a convolved feature, and finally the features of all layers are connected by a concat operation to form the final output. The formula for acquiring the first characteristic diagram and the second characteristic diagram is as follows:

Feature conversion and feature fusion unit: considering that different scale features have different receptive fields, and that large scale features have guiding effect on the retrieval of small scale features, the application of the feature transformation and feature fusion unit realizes cross-scale interaction. Referring to FIG. 3, for a first feature mapAnd a second characteristic map->Adding position coding (giving unique position information for each element in sinusoidal format) and attention mechanisms forms the first feature +.>And second feature->Referring to FIG. 4, the first feature +.>And second feature->Further conversion to the first trans-scale feature +.>And a second trans-scale feature->In the feature fusion unit, the FB uses the channel attention module to better fuse the features of inconsistent semantics and scale, and finally obtains the features after cross-scale connection +.>And->The conversion formula and the fusion formula of the feature conversion and feature fusion unit are respectively as follows:

in the method, in the process of the invention,representation ofAn operation of adding an attention mechanism and a position code on the ith branch, < >>Representing the operation of the ith branch FB.

Similarity calculation unit: referring to FIG. 5, a first trans-scale feature is illustratedAnd a second trans-scale feature->And (3) performing feature similarity calculation through inner product operation to construct a complete feature correlation body (Correlation volume, C), wherein the formula of the feature correlation body is as follows:

the method is characterized in that a multi-layer correlation pyramid is further constructed through a Concat operation based on C, displacement information about pixels is provided in the correlation pyramid, and pixels are searched from each layer of the correlation pyramid to generate a feature map by using current parallax estimation in correlation lookup through RAFT lookup rules. Referring to fig. 6, given the current disparity estimation (d ¹ ,d ² ) F is to F _L The pixel x= (u, v) in (b) is mapped to F _R ，F _R The pixel x' in (a) is represented by the following formula:

x'＝{x+dx,||dx|| ₁ ≤R}

where x' is the search offset and R is a constant, representing the search range. Searching from all levels of the related pyramid using search grid x', indexing each C ^k And (3) using linear interpolation in the layer process, and finally connecting the search value of each level into the feature map through a Concat operation.

Parallax iterative updating unit: referring to fig. 7, the gating loop unit (Gated Recurrent Unit, GRU) is from an initial disparity d ₀ Starting with =0 to estimate the disparity sequence { d } ₁ ,d ₂ ,...,d _n Each time (f)An update direction Δd is generated in each iteration, and is fed back to the correlation pyramid in the next iteration to perform a search and be applied to the current disparity estimation: d, d _k+1 ＝Δd+d _k Updating for the last time to obtain the final parallax; the disparity of all pixels is visualized in combination as a disparity map. The disparity estimation is to input the first feature map, feature similarity, and the disparity estimated last time into the GRU, and the GRU updates the hidden state, and further estimates a new disparity by the updated hidden state. The update strategy only occurs on fixed resolution, but the method makes the receptive field limited, the obtained semantic information is too little, the details of the weak texture region cannot be described, and the weak texture region cannot be predicted well, so that the similarity of each feature is calculated under different scales by using a cascade method of similarity calculation and parallax update, the context features are combined, and finally, related searching and updating are carried out through the highest resolution of GRU in multiple resolutions. The GRUs are cascaded using the hidden states of each other as inputs, feature mapped at different resolutions, information transmitted between GRUs of adjacent resolutions using an upsampling operation, looked up from the associated pyramid at the highest resolution, and then the next disparity estimation. For an end-to-end supervised training model, L is used ₁ The loss function trains the whole network, and the target loss function calculation process of parallax estimation is as follows:

S105, calculating scene depth information of the parallax map, and obtaining a measurement distance.

Based on the disparity map, scene depth information is calculated according to a triangle similarity principle, a measurement distance is obtained, and a depth estimation formula is as follows:

due to parallax d=x _L -x _R The formula can be sorted as:

the available depth distance Z is:

where f represents the focal length of the binocular camera, B represents the baseline distance between the binocular cameras, and d is the parallax of the object points of the left and right cameras. Accordingly, the depth distance Z is related only to the focal length value f of the binocular camera, the base line distance B between the binocular cameras, and the parallax d of the object points of the left and right cameras. Since the camera lens commonly used in binocular vision ranging technology is fixed focus, the focal length value f of the binocular camera is fixed; second, the base line distance B between the left and right cameras of the binocular camera is generally constant after the binocular camera is mounted. Therefore, after the parallax value d of each point is calculated, the depth distance Z can be obtained, and the task of binocular vision ranging is completed.

Corresponding to the binocular ranging method based on the depth stereo matching algorithm provided in the above embodiment, the present application also provides an embodiment of a binocular ranging system based on the depth stereo matching algorithm.

Referring to fig. 8, a binocular ranging system 20 based on a depth stereo matching algorithm in an embodiment of the present application includes:

the image acquisition module 201 is configured to acquire checkerboard image pairs and live-action stereoscopic image pairs of the same scene at different angles and different distances;

the camera calibration module 202 is configured to perform error analysis on the checkerboard image pair to obtain a camera calibration parameter;

the stereo correction module 203 is configured to correct the pair of live-action stereo images through the camera calibration parameter, so that two images in the pair of live-action stereo images are located on the same plane and are parallel to each other;

a stereo matching module 204, configured to obtain a parallax map of the corrected live-action stereo image pair by constructing a stereo matching network;

the depth calculation module 205 is configured to perform scene depth information calculation on the disparity map, and obtain a measurement distance.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

The foregoing is merely specific embodiments of the present application, and any person skilled in the art may easily conceive of changes or substitutions within the technical scope of the present application, which should be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The binocular distance measuring method based on the depth stereo matching algorithm is characterized by comprising the following steps of:

2. The binocular ranging method based on the depth stereo matching algorithm according to claim 1, wherein the performing error analysis on the checkerboard image pair to obtain camera calibration parameters comprises:

3. The binocular ranging method based on the depth stereo matching algorithm according to claim 1, wherein the correcting the pair of live-action stereo images by the camera calibration parameter such that two images of the pair of live-action stereo images are located on the same plane and parallel to each other comprises:

4. The binocular ranging method based on the depth stereo matching algorithm according to claim 1, wherein the stereo matching network comprises: the device comprises a feature extraction unit, a feature conversion and feature fusion unit, a similarity calculation unit and a parallax iteration update unit;

5. The binocular ranging method based on the depth stereo matching algorithm according to claim 4, wherein the extracting initial features of different scales from the corrected pair of live-action stereo images, respectively, and then acquiring the first and second feature maps using a residual dense block RDB comprises:

using different numbers of RDBs for the initial features of different scales to obtain a first feature mapAnd a second characteristic map->Acquiring a first feature map and a second feature mapThe formula of (2) is:

6. The binocular ranging method based on depth stereo matching algorithm according to claim 5, wherein the introducing the position coding and attention mechanism in the transform algorithm converts the first and second feature maps into first and second features dependent on context and position correlation, and then realizes feature fusion of different scales through a fusion block FB to obtain first and second trans-scale features, comprising:

for the first feature mapAnd said second profile +.>Adding position coding and attention mechanisms to form a first featureAnd second feature->

By applying the first feature toAnd said second feature->Conversion to first trans-scale feature->And a second trans-scale feature

7. The binocular ranging method of claim 6, wherein the computing the feature similarity of the first and second trans-scale features through an inner product operation to construct a feature correlation volume, and further construct a multi-layer correlation pyramid, comprises:

8. The binocular ranging method based on the depth stereo matching algorithm according to claim 7, wherein the step of introducing the gate control loop unit GRU to perform parallax iterative update to obtain a final parallax map comprises the steps of:

GRU from initial disparity d ₀ Starting with =0 to estimate the disparity sequence { d } ₁ ,d ₂ ,...,d _n An update direction Δd is generated in each iteration, which is fed back into the correlation pyramid in the next iteration to perform a search and be applied to the current disparity estimation: d, d _k+1 ＝Δd+d _k Last timeUpdating to obtain a final parallax;

9. The binocular ranging method based on the depth stereo matching algorithm according to claim 1, wherein the calculating scene depth information for the disparity map, obtaining a measurement distance, comprises:

the depth estimation formula is:

due to parallax d=x _L -x _R The formula can be sorted as:

the available depth distance Z is:

10. A binocular ranging system based on a depth stereo matching algorithm, comprising: