CN115018934A

CN115018934A - Three-dimensional image depth detection method combining cross skeleton window and image pyramid

Info

Publication number: CN115018934A
Application number: CN202210792911.7A
Authority: CN
Inventors: 刘之涛; 夏越; 苏宏业
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-09-06
Anticipated expiration: 2042-07-05
Also published as: CN115018934B

Abstract

The invention discloses a three-dimensional image depth detection method combining a cross-shaped skeleton window and an image pyramid. Acquiring left and right images in real time by using a binocular camera, and calculating an initial disparity map; calculating gradient information in X and Y directions to obtain a descriptor; constructing a cross skeleton window by combining the color similarity and the distance constraint of the gray level image, and selecting the branch with the minimum length as the minimum arm length; constructing an image pyramid by performing Gaussian downsampling on the stereo image for multiple times; and sequentially updating the adjacent higher layers by the lower layers by using the initial disparity map and the cross skeleton window, and processing layer by layer upwards to obtain the final disparity map so as to obtain the depth of the stereo image. The invention adopts the cross skeleton window to improve the parallax precision of the support points and is easy to parallelize; meanwhile, more support points are obtained in the weak texture area by combining the image pyramid and utilizing multi-scale information, so that the phenomenon of cross-edge connection is improved; and the parallax searching range of high resolution is narrowed, the mismatching support points are reduced, and the time of high resolution is effectively shortened.

Description

Three-dimensional image depth detection method combining cross skeleton window and image pyramid

Technical Field

The invention relates to a binocular stereo matching method of depth estimation and stereo matching direction, and relates to a stereo image depth detection method combining a cross skeleton window and an image pyramid.

Background

Depth estimation is one of the most important issues in computer vision. The depth estimation from the binocular stereo image pair is the core theme of low-level vision, the key task of the method is to find the corresponding relation of space pixels in the image pair, namely stereo matching, and then the three-dimensional geometric information and the depth information of a scene are obtained by utilizing the imaging principle and triangulation. The stereo matching determines the pixel coordinate information of the target point in the image pair, and calculates the parallax value of the target point, which is the most challenging research content in the binocular stereo vision system and is also the core research content. However, due to the fact that the received illumination is inconsistent, the binocular stereo matching accuracy is greatly affected due to the fact that the binocular stereo matching accuracy is affected by the problems of no texture, weak texture, shielding and the like, and how to design a stereo matching algorithm capable of effectively avoiding interference has great challenge.

The conventional stereo matching algorithm generally consists of 4 steps: matching cost calculation, cost aggregation, parallax calculation and parallax refinement. Generally, the stereo matching algorithm mainly comprises a global algorithm and a local algorithm. Global algorithms typically solve the optimization problem by minimizing a global objective function, which contains data and smoothing terms. Many techniques are used to solve this NP problem with difficulty and efficiency, but are computationally expensive and are rarely used in real-time systems. The local stereo matching algorithm utilizes the pixel information in the neighborhood to carry out constraint, so the calculation amount is low, and the efficiency is higher than that of the global algorithm. However, local algorithms are susceptible to image noise and matching ambiguities may occur in poorly textured or repetitive textured areas. The high-efficiency and high-precision stereo matching algorithm plays a key role in many practical applications, such as robot navigation, automatic driving, unmanned aerial vehicles and other fields. It is still a challenge how to efficiently obtain high precision parallax in a large image pair. ELAS is an efficient high-resolution stereo method, and can complete stereo matching or depth estimation in linear time. However, cross-edge connection is easy to occur, the influence of mismatching points is large, and the parallax precision needs to be improved in the weak texture region.

Disclosure of Invention

In view of the above, in order to solve the problems in the background art, the present invention is directed to a method CS-ELAS for detecting depth of a stereo image by combining a cross skeleton window and an image pyramid.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

step 1: acquiring left and right images in real time through a binocular camera, wherein the left and right images form an original stereo image, and calculating a disparity map between the left and right images according to a variation relation between a target and a background in the left and right images and taking the disparity map as an initial disparity map of the stereo image; the disparity map is a map representing the distance relationship between the object and the background in the left and right images.

Step 2: calculating gradient information of the stereoscopic image in the X direction and the Y direction, and obtaining the gradient information of each pixel point in the X direction and the Y direction, wherein the X direction is along the transverse direction of the image, and the Y direction is along the longitudinal direction of the image, so as to obtain a descriptor of the pixel point;

and step 3: constructing a cross skeleton window by combining the color similarity and the distance constraint of the gray level image, wherein the cross skeleton window is provided with four branches, the branch with the minimum length in the four branches is selected from the cross skeleton window, the length of the branch with the minimum length is used as the minimum arm length of the cross skeleton window, and the minimum arm length is used as the radius of a matching window for subsequently calculating the matching similarity of each pair of left and right image pairs;

and 4, step 4: performing multiple Gaussian downsampling on an original stereo image to construct an image pyramid;

and 5: and sequentially updating the stereo images of the adjacent higher layers of the image pyramid by using the initial disparity map and the cross skeleton window and the stereo images of the lower layers of the image pyramid, processing the image pyramid layers upwards, and obtaining a final disparity map so as to obtain the depth of the stereo images.

According to the invention, the image pyramid is processed layer by combining the parallax map and the cross skeleton window, so that the depth of the three-dimensional image can be rapidly and accurately obtained.

The step 2 specifically comprises the following steps:

a gradient window is established for the X-direction and the Y-direction, respectively:

the gradient window in the X direction is 7X7, the gradient information in the X direction of 24 pixel points is selected in the gradient window, and the 24 pixel points comprise 22 pixel points which are uniformly selected in eight directions along the neighborhood of the central pixel of the gradient window and a central pixel point which is selected twice;

the gradient window in the Y direction is 5x5, the gradient information in the Y direction of 8 pixel points is selected in the gradient window, and the 8 pixel points comprise four-corner pixel points and four-side center pixel points of the gradient window;

aiming at each pixel point, obtaining gradient information in the X direction and gradient information in the Y direction through gradient windows in the X direction and the Y direction, and forming a descriptor of the pixel point by the gradient information in the X direction and the gradient information in the Y direction of the pixel point;

the processing for the stereo image mentioned in the method of the present invention is to perform the same processing for each of the left and right images in the stereo image.

In the step 4, the image pyramid comprises an original stereo image and the stereo image after each Gaussian down-sampling; the original stereo image has the highest resolution, and the highest layer of the image pyramid is formed; the resolution of the stereo image after each Gaussian down-sampling is reduced, the stereo image after each Gaussian down-sampling forms one layer of an image pyramid, and the resolution of the stereo image after the last Gaussian down-sampling is the lowest, so that the lowest layer of the image pyramid is formed.

The step 5 specifically comprises the following steps:

step 5.1:

aiming at the stereo image at the lowest layer of the image pyramid, processing according to the disparity map at the lowest layer of the image pyramid and a cross skeleton window to obtain a support point set and confidence degrees of the lowest layer of the image pyramid as all initial support points and confidence degrees thereof by taking the initial disparity map as the disparity map at the lowest layer of the image pyramid;

step 5.2:

for the stereo image of the current layer of the image pyramid, performing Delaunay triangulation by taking a support point with a confidence coefficient higher than a preset first confidence coefficient threshold value as a vertex of the triangulation to establish a plurality of triangular parallax planes;

in the current disparity map, performing linear interpolation on all pixels in the triangular disparity plane through the triangular disparity plane and three vertex information of the triangular disparity plane to update the disparity map, simultaneously processing according to the updated disparity map and the cross skeleton window to obtain the matching similarity and confidence coefficient of each pixel in the stereo image, forming a confidence coefficient map by the confidence coefficients of all the pixels, and obtaining a confidence coefficient map of the stereo image at the current layer of the image pyramid;

step 5.3:

obtaining a disparity threshold value prior image and a confidence value prior image with higher resolution by respective nearest neighbor interpolation processing according to a disparity map of a current layer of an image pyramid and a confidence value map of a stereo image, selecting pixel points with confidence values higher than a preset second confidence value threshold value in the confidence value prior image as supplementary support points of a higher layer of the image pyramid, and assigning and updating the disparity values of the supplementary support points in the disparity threshold value prior image to the updated disparity map as the disparity values of the pixel points corresponding to the supplementary support points, thereby obtaining the disparity map of the higher layer of the image pyramid;

step 5.4:

processing the stereo image of the higher layer of the image pyramid according to the disparity map of the higher layer of the image pyramid and the cross skeleton window to obtain a support point set of the higher layer of the image pyramid and a confidence coefficient of the support point set;

and step 5.5: supplementing the supplementary support points obtained in the step 5.3 into the support point set obtained in the step 5.4 to obtain a union set;

step 5.6: and returning to the step 5.2, continuously repeating the step 5.2 to the step 5.5 to perform iterative updating until a higher layer of the image pyramid is obtained, namely the highest resolution is obtained, obtaining a disparity map of the higher layer of the image pyramid, namely the disparity map of the highest resolution, obtaining the disparity map of the highest layer of the image pyramid layer as a final disparity map, and obtaining the depth of the stereoscopic image by using the final disparity map.

In the step 5.2, processing is performed according to the updated disparity map and the cross skeleton window to obtain the matching similarity and the confidence of each pixel in the stereo image, specifically:

5.2.1, taking one of the stereo images as a current image and the other image as a reference image, and traversing each pixel point in the current image as a current pixel point according to the following modes:

then, a pixel point corresponding to the parallax of the current pixel point in the reference image is found by using the updated parallax image, the matching similarity between the current pixel point and the pixel point corresponding to the parallax of the current pixel point is calculated by using the descriptor of the current pixel point and the pixel point corresponding to the parallax of the current pixel point to serve as the matching similarity of the current pixel point, and then the confidence of the current pixel point is calculated according to the matching similarity of the current pixel point;

and 5.2.2, interchanging the current image and the reference image, and repeating the step 5.2.1 to obtain the matching similarity and the confidence coefficient of each pixel of the two images in the stereo image.

The step 5.1 and the step 5.4 are both specifically as follows:

any one of the stereo images is used as a current image, and the other one is used as a reference image;

uniformly sampling the current image along the horizontal and vertical coordinates by using pixel points with fixed step length to obtain candidate support points, and traversing each candidate support point according to the following modes:

in the current image, firstly, a matching window neighborhood is established by taking a candidate support point as a central pixel and taking the minimum arm length of a cross framework window as the radius of the neighborhood, and 9 key points are selected in the matching window neighborhood; the 9 key points comprise four-corner pixel points, four-side center pixel points and center pixel points of a matching window neighborhood.

Then, a disparity map is utilized to find pixel points corresponding to the disparity of the key points in the reference image, descriptors of the key points and the pixel points corresponding to the disparity of the key points are utilized to calculate to obtain matching similarity between the key points and the pixel points corresponding to the disparity of the key points, the matching similarity is used as the matching similarity of the key points, and the matching similarity of all the key points is added to obtain the matching similarity of the candidate support points;

then, judging the matching similarity of the candidate support points:

if the matching similarity is larger than a preset similarity threshold, reserving the candidate support points as support points;

if the matching similarity is not greater than a preset similarity threshold, discarding the candidate support points, and not taking the candidate support points as the support points;

and traversing each candidate support point to obtain all support points, constructing a support point set by all the support points, and calculating the confidence of the support points according to the matching similarity of the support points.

The parallax correspondence is determined by the parallax relation in the parallax map.

In said step 5.1 and step 5.4, the central pixel (u) is calculated _n ，v _n ) When the matching similarity of the candidate support points is obtained, the matching similarity of all the key points is added to obtain the matching similarity of the candidate support points, specifically, a descriptor of 9 pixel points is selected in a (2a +1) x (2a +1) neighborhood of the candidate support points according to the following formula to calculate the similarity of the candidate support points, and the similarity maximization is equal to the cost function minimization to calculate:

E(u _n ，v _n ，d _n )＝∑||f(u ^(l) ，v ^(l) )-f(u ^(r) ，v ^(r) )|| ₁ if, if

Wherein the content of the first and second substances,

wherein a is the minimum arm length of the cross skeleton window obtained in the step 3, n represents the pixel subscript of the image, and u represents the pixel subscript of the image _n ，v _n Respectively representing the abscissa, d, of the nth pixel of the image _n Then the pixel point is traversed by the parallax value, E (u) _n ，v _n ，d _n ) Representing a pixel (u) _n ，v _n ) Parallax is d _n Is matched with the cost，u ^(r) ，v ^(r) Respectively representing the horizontal and vertical coordinates of the selected key points in the current image, l representing the current image, u ^(r) ，v ^(r) Respectively representing the pixel abscissa and ordinate of the corresponding key point in the reference image, r representing the reference image, | | ₁ Representing the L1 norm.

In the step 4, the image is downsampled to obtain images with different resolutions, an image pyramid is constructed, and the lowest resolution image obtained by downsampling is used as the (s-0) -th layer image.

And upsampling the disparity map and the confidence map of the image pyramid lower resolution layer through nearest neighbor interpolation to be used as a priori of the image pyramid higher resolution layer.

Firstly, the parallax value of the lower resolution layer of the image pyramid is utilized to reduce the parallax searching range when the high resolution computing support point is calculated, and the computing cost of the higher resolution layer of the image pyramid is also reduced; and secondly, taking the parallax value of the pixel point with higher reliability in the lower resolution layer of the image pyramid than that in the higher resolution layer of the image pyramid as the parallax value of the point in the higher resolution layer of the image pyramid, and updating the support point set.

The concrete expression is as follows:

if it is not

And (u) _m ，v _m )∈S

Wherein S represents a set of support points, m represents a subscript of a support point in the set, and (u) represents a subscript of a support point in the set _m ，v _m ) Pixel coordinate, u, representing the m-th support point _m Denotes the abscissa, v _m Represents the ordinate; s represents the s-th layer of the image pyramid, with increasing image resolution as s increases;

representing the m-th support of the s-th layer with confidence after one nearest neighbor upsamplingThe confidence value at the point pixel or pixels,

representing a confidence value at the mth support point pixel of the s +1 th layer image;

the disparity map of the s-th layer represents the disparity value at the m-th support point pixel after being upsampled by the nearest neighbor,

representing the disparity value at the mth support point pixel of the s +1 th layer image.

The weak texture regions in the image have fewer support points, and cross-edge connection is easy to occur, so that a plurality of obvious triangular regions exist in the parallax map result. And if the support point of the mismatching occurs, the parallax value error obtained by utilizing triangulation and linear interpolation is obviously increased.

The invention adopts the cross skeleton window to improve the parallax precision of the support points, and the support points are easy to parallelize. Meanwhile, by combining the image pyramid and utilizing multi-scale information, more support points are obtained in the weak texture area, and the phenomenon of cross-edge connection is improved. And the parallax searching range of high resolution is narrowed, and the time of high resolution is effectively shortened.

The invention has the beneficial effects that:

(1) according to the method, the cross-shaped framework window is constructed for each pixel, the window size during matching similarity calculation is expanded in a self-adaptive mode, edge information can be better sensed, and the parallax accuracy of the support points can be improved. Meanwhile, the number of matching support points is reduced, and the occurrence of mismatching support points is reduced.

(2) According to the invention, information with different resolutions is combined from coarse to fine through an image pyramid. Firstly, a disparity map and a confidence map obtained by low resolution are subjected to nearest neighbor upsampling to serve as a disparity threshold prior and a confidence prior when a high resolution calculation support point is used, so that the disparity search range can be reduced, and the calculation amount can be reduced. And meanwhile, the parallax value of the pixel point with higher confidence value in lower resolution and higher resolution is used as the parallax value of the point in the higher resolution layer of the image pyramid, and the support point set of the image pyramid is updated, so that more support points can be obtained in the weak texture region.

Drawings

FIG. 1 is a construction diagram of a descriptor of the present invention.

FIG. 2 is a graph comparing the number of support points according to the present invention

FIG. 3 is a system flow diagram of the present invention.

Detailed Description

The invention will be further described with reference to the following figures and specific examples, but the scope of the invention is not limited thereto.

As shown in fig. 3, an embodiment of the present invention is as follows:

step 1: acquiring left and right images in real time through a binocular camera, wherein the left and right images form an original stereo image, and calculating a disparity map between the left and right images according to a change relation between a target and a background in the left and right images and using the disparity map as an initial disparity map of the stereo image;

step 2: and filtering the left and right stereo images subjected to epipolar line correction to respectively obtain gradient images in the X direction and the Y direction. Respectively from the central pixel point (u) _n ，v _n ) The descriptor f (u) of the feature point is constructed by selecting 24 gradient values in the X direction in the 7X7 neighborhood and 8 gradient values in the Y direction in the 5X5 neighborhood _n ，v _n ) The selection is as shown in FIG. 1. As can be seen from fig. 1, the gradient information is selected in each direction of the central pixel, so that the change in different directions can be sensed better, and the matching accuracy is improved.

And step 3: and constructing a cross skeleton window for each pixel, respectively performing extension operation in four directions of the upper direction, the lower direction, the left direction and the right direction of the central pixel point by taking color similarity and distance constraint as rules to obtain arm length values in the four directions, and taking the minimum arm length as a. Calculating the center pixel (u) _n ，v _n ) When the matching similarity is obtained, descriptors of 9 pixel points are selected in the (2a +1) × (2a +1) neighborhood to calculate the similarity of the matching pair. Matching energy function:

Wherein the content of the first and second substances,

where a is the minimum arm length of the cross-frame window of the center pixel, n represents the pixel index of the image, i.e., the nth pixel, and u represents the minimum arm length of the cross-frame window of the center pixel _n ，v _n Respectively representing the abscissa, d, of the nth pixel in the image _n The value of the parallax traversed by the pixel point, E (u) _n ，v _n ，d _n ) Representing a pixel (u) _n ，v _n ) Parallax is d _n Matching cost of u ^(l) ，v ^(l) Respectively representing the horizontal and vertical coordinates of the pixels of the key points selected near the central pixel in the current image, l representing the current image, u representing the current image ^(r) ，v ^(r) Respectively representing the pixel abscissa and ordinate of the corresponding keypoint in the reference image, r representing the reference image, | | ₁ Representing the L1 norm.

In a specific implementation, the parameters are kept consistent when the arm lengths of the images with different resolutions are calculated. The similarity maximization is converted into an energy function minimization problem. The method for calculating the support points by comparing the sizes of the cross skeleton adaptive expansion matching window and the fixed window is tested on all pictures on the Middlebury2014 full-resolution training data set, wherein the number of the support points is shown in fig. 2, and the evaluation result is shown in table 1:

table 1 comparison of support points

Algorithm	bad1.0	avgerr
			Fixed window	0.1204	4.6272
Cross skeleton window	0.0916	3.7828

Wherein bad1.0 represents the percentage of the pixel number with the difference between the parallax value obtained by the experiment and the real parallax value larger than 1.0 in the pixel number of the full image, and averr represents the average value of the absolute difference between the parallax value obtained by the experiment and the real value. The abscissa axis in fig. 2 represents the image name in the data set, and the ordinate axis represents the number of support points. As can be seen from table 1 and fig. 2, the number of support points obtained by using the cross-shaped frame window is slightly reduced, the error rate is reduced, and the parallax accuracy is improved, as compared with the support points obtained by fixing the size of the matching window.

And 5: and upsampling the low-resolution disparity map and the confidence map through nearest neighbor interpolation to be used as a high-resolution priori. Firstly, the parallax value obtained by the lower resolution image is utilized to reduce the parallax searching range when the higher resolution image is used for calculating the support point, and the calculating cost of the higher resolution image is also reduced; and secondly, taking the parallax value of the pixel point with high reliability in higher resolution in lower resolution as the parallax value of the point in high resolution, and updating the support point set.

In a specific implementation, the candidate support point (u) of higher resolution (s +1 layer) _m ，v _m ) Disparity search range of

The calculation is as follows:

wherein [ D ] _min ，D _max ]Is a parallax search range, D, set in advance for parallax calculation _min To set minimum value of parallax, D _max For the maximum value of parallax set, the value was set to [0,790 ] in the experiment](ii) a t is a constant value set, and is set to 10 in the experiment; s represents the s-th layer of the image pyramid, with increasing image resolution as s increases; m represents the index of the support point in the set of support points, (u) _m ，v _m ) Pixel coordinate, u, representing the m-th support point _m Representing the abscissa, v _m Represents the ordinate;

the confidence map representing the s-th layer image has confidence values at the m-th support point pixel after nearest neighbor upsampling,

the disparity map representing the s-th layer image is subjected to the disparity value at the m-th support point pixel after the nearest neighbor up-sampling.

Taking the pixel point information with higher belief ratio than high resolution in the low resolution layer as a support point to update a support point set obtained by a high resolution image:

if it is not

And (u) _m ，v _m )∈S

Wherein S represents a set of support points,

representing the confidence value at the m support point pixel of the s +1 th layer image,

As shown in fig. 2, after the image pyramid is combined on the basis of the cross skeleton, the number of the support points obtained by calculation is reduced, and the probability of occurrence of the mismatching points is reduced.

The general flow of the invention is shown in fig. 3, according to which the algorithm evaluation is performed. Table 2 shows the evaluation and comparison results of the algorithm, and it can be known that, on the one hand, the accuracy and precision of the parallax are improved, and on the other hand, the characteristics of high speed and high efficiency are maintained in the Middlebury and KITTI two public binocular stereo data sets.

Table 2 data set evaluation results and comparisons

Wherein bad1.0 represents the percentage of the number of pixels with the difference between the parallax value obtained by the experiment and the real parallax value larger than 1.0, and bad2.0 represents the percentage of the number of pixels with the difference between the parallax value obtained by the experiment and the real parallax value larger than 2.0; the averr represents the average value of absolute differences between the parallax value and the true value obtained by the experiment; time represents the time spent. Out-Noc represents the percentage of the number of pixels with the parallax error larger than the threshold value 3 in the non-shielding area, and Out-All represents the percentage of the number of pixels with the parallax error larger than the threshold value 3 in All areas; the Avg-Noc represents the average value of the absolute difference values of the parallaxes of the non-shielding areas, the Avg-All represents the average value of the absolute difference values of the parallaxes of All the areas, and the px represents the unit of the pixel.

Table 3 and table 4 show the results of evaluating the dense disparity and the sparse disparity of each algorithm in the Middlebury2014 data set. Where H denotes evaluation using a half-resolution image and F denotes evaluation using a full-resolution image. The thickened evaluation result of the algorithm provided by the invention is improved in the evaluation of dense parallax and sparse parallax.

Table 3 Middlebury2014 training dataset dense disparity map non-occlusion region bad1.0

Table 4 Middlebury2014 training dataset sparse disparity map non-occlusion region bad1.0

	ArtL	Motor	Piano	Pipes	Playrm	PlaytP	Recyc	Shelvs	Teddy	Vintge
											SGM(F)	3.59	17	8.63	7.9	13.6	7.17	8.15	15.5	3.6	7.24
SNCC(H)	8.63	20.1	8.36	10.5	17.2	10.1	12.1	19.7	4.99	10.4
											CS-ELAS(F)	5.97	15.4	13.2	9.54	16.5	12.2	13.3	20.9	5.49	14.8
SGM-Forest(H)	8.23	8.4	17.2	8.23	21.7	15.1	14	39.1	6.06	26
											DISCO(H)	7.36	14.5	15.8	17.4	22.3	12.3	11.9	24	9.17	29.2
LS-ELAS(F)	4.94	21.6	21.3	13.4	28.4	15.7	25.2	31.5	6.49	29.7
											ELAS(F)	5.94	24.3	33.2	18.5	37.7	19.7	36.3	51.3	11.3	27.2
DecStereo(F)	27.4	25.3	24.7	26.1	30.2	22.8	30.8	45.1	13	40.9

From all the evaluation results, the cross skeleton provided by the invention not only reduces the number of image support points and the occurrence probability of mismatching points, but also reduces the average value of absolute parallax error values and improves the parallax precision of the support points; the image pyramid provided by the invention combines high-resolution and low-resolution image information, increases support points of a weak texture area, improves the whole parallax, and improves the disclosed binocular stereo data set.

Claims

1. A three-dimensional image depth detection method combining a cross skeleton window and an image pyramid is characterized by comprising the following steps:

step 2: calculating gradient information of the three-dimensional image in the X direction and the Y direction to obtain descriptors of pixel points;

and step 3: constructing a cross skeleton window by combining the color similarity and the distance constraint of the gray level image, selecting a branch with the minimum length from the four branches from the cross skeleton window, and taking the length of the branch with the minimum length as the minimum arm length of the cross skeleton window;

2. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 1, wherein: the step 2 specifically comprises the following steps:

and aiming at each pixel point, obtaining gradient information in the X direction and gradient information in the Y direction through gradient windows in the X direction and the Y direction, and forming a descriptor of the pixel point by the gradient information in the X direction and the gradient information in the Y direction of the pixel point.

3. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 1, wherein:

in the step 4, the image pyramid comprises an original stereo image and a stereo image after each Gaussian downsampling; the original stereo image has the highest resolution, and the highest layer of the image pyramid is formed; the resolution of the stereo image after each Gaussian down-sampling is reduced, the stereo image after each Gaussian down-sampling forms one layer of an image pyramid, and the resolution of the stereo image after the last Gaussian down-sampling is the lowest, so that the lowest layer of the image pyramid is formed.

4. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 1, wherein: the step 5 specifically comprises the following steps:

step 5.1:

aiming at the stereo image at the lowest layer of the image pyramid, processing according to the disparity map at the lowest layer of the image pyramid and a cross skeleton window by taking an initial disparity map as the disparity map at the lowest layer of the image pyramid to obtain a support point set and a confidence coefficient of the lowest layer of the image pyramid;

step 5.2:

in the current disparity map, performing linear interpolation on all pixels in the triangular disparity plane through the triangular disparity plane and three vertexes thereof to update the disparity map, simultaneously performing processing according to the updated disparity map and a cross skeleton window to obtain matching similarity and confidence coefficient of each pixel in the stereo image, forming a confidence coefficient map by the confidence coefficients of all the pixels, and obtaining a confidence coefficient map of the stereo image at the current layer of the image pyramid;

step 5.3:

step 5.4:

step 5.5: supplementing the supplementary support points obtained in the step 5.3 into the support point set obtained in the step 5.4 to obtain a union set;

step 5.6: and returning to the step 5.2, continuously repeating the step 5.2-the step 5.5 to perform iterative updating until a higher layer of the image pyramid is obtained, obtaining a disparity map of the higher layer of the image pyramid, obtaining a disparity map of the highest layer of the image pyramid layer as a final disparity map, and obtaining the depth of the stereoscopic image by using the final disparity map.

5. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 4, wherein: in the step 5.2, processing is performed according to the updated disparity map and the cross skeleton window to obtain the matching similarity and the confidence of each pixel in the stereo image, specifically:

then, a pixel point corresponding to the parallax of the current pixel point in the reference image is found by using the updated parallax image, the matching similarity between the current pixel point and the pixel point corresponding to the parallax thereof is calculated by using the descriptor of the current pixel point and the pixel point corresponding to the parallax thereof and is used as the matching similarity of the current pixel point, and then the confidence of the current pixel point is calculated according to the matching similarity of the current pixel point;

6. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 4, wherein: the step 5.1 and the step 5.4 are both specifically as follows:

any one of the stereo images is used as a current image, and the other one is used as a reference image; uniformly sampling the current image along the horizontal and vertical coordinates by a fixed step length to obtain candidate support points, and traversing each candidate support point according to the following modes: in the current image, firstly, a matching window neighborhood is established by taking a candidate support point as a central pixel and taking the minimum arm length of a cross framework window as the radius of the neighborhood, and 9 key points are selected in the matching window neighborhood; then, a disparity map is utilized to find pixel points corresponding to the disparity of the key points in the reference image, descriptors of the key points and the pixel points corresponding to the disparity of the key points are utilized to calculate to obtain matching similarity between the key points and the pixel points corresponding to the disparity of the key points, the matching similarity is used as the matching similarity of the key points, and the matching similarity of all the key points is added to obtain the matching similarity of the candidate support points;

then, judging the matching similarity of the candidate support points:

7. The method for detecting the depth of the stereoscopic image by combining the cross skeleton window and the image pyramid as claimed in claim 4, wherein: in said step 5.1 and step 5.4, the central pixel (u) is calculated _n ，v _n ) When the matching similarity of the candidate support points is obtained, the matching similarities of all the keypoints are added to obtain the matching similarity of the candidate support points, and the matching similarity is specifically calculated according to the following formula:

Wherein the content of the first and second substances,

wherein a is the minimum arm length of the cross skeleton window obtained in the step 3, n represents the pixel subscript of the image, and u represents the pixel subscript of the image _n ，v _n Respectively representing the abscissa, d, of the nth pixel of the image _n Then the pixel point is traversed by the parallax value, E (u) _n ，v _n ，d _n ) Representing a pixel (u) _n ，v _n ) Parallax is d _n Matching cost of u ^(l) ，v ^(l) Respectively representing the horizontal and vertical coordinates of the selected key points in the current image, l representing the current image, u ^(r) ，v ^(r) Respectively representing the pixel abscissa and ordinate of the corresponding keypoint in the reference image, r representing the reference image, | | ₁ Representing the L1 norm.