CN110533710B

CN110533710B - Method and processing device for binocular matching algorithm based on GPU

Info

Publication number: CN110533710B
Application number: CN201910779546.4A
Authority: CN
Inventors: 符强; 罗鑫禹; 孙希延; 纪元法; 任风华; 严素清; 付文涛
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2023-07-14
Anticipated expiration: 2039-08-22
Also published as: CN110533710A

Abstract

The embodiment of the invention discloses a method and a processing device for a binocular matching algorithm based on a Graphic Processing Unit (GPU), which are used for improving the operation efficiency of the image matching algorithm in binocular vision and improving the instantaneity of a binocular depth perception technology. The method of the embodiment of the invention comprises the following steps: acquiring first picture data and second picture data which are respectively acquired by different cameras; according to the first picture data and the second picture data, cost calculation is carried out, and cost value is obtained; synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction; and determining the disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction.

Description

Method and processing device for binocular matching algorithm based on GPU

Technical Field

The invention relates to the field of image processing, in particular to a method and a processing device for a binocular matching algorithm based on a GPU.

Background

With the development of science and technology, unmanned robots are being widely used in various aspects, and the unmanned robots have a common requirement and need to sense the distance. The current common ranging methods mainly have two main types: active ranging, ultrasonic ranging, infrared ranging, binocular vision ranging, and the like. Although the principle of the active ranging mode is simpler, the real-time performance is higher, the active ranging mode is easily influenced by an object reflecting surface, an external light environment and the like, and therefore the active ranging mode is not taken as a main ranging mode in the unmanned machine field.

Binocular vision ranging, obtaining scenery images through two cameras, calculating parallax by utilizing different scenery imaging positions between the two cameras, and then calculating the final distance according to the estimated parallax. The existing binocular vision ranging algorithm has larger calculated amount in the image matching stage, so that real-time performance is difficult to ensure, and binocular vision technology cannot be well applied to unmanned robots.

Disclosure of Invention

The embodiment of the invention provides a method and a processing device for a binocular matching algorithm based on a Graphic Processing Unit (GPU), which are used for improving the operation efficiency of the image matching algorithm in binocular vision and improving the instantaneity of a binocular depth perception technology.

In view of this, a first aspect of the present invention provides a method for a GPU-based binocular matching algorithm, which may include:

acquiring first picture data and second picture data, wherein the first picture data and the second picture data are respectively acquired by different cameras;

according to the first picture data and the second picture data, cost calculation is carried out, and cost value is obtained;

synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction;

performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction;

and determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction.

Optionally, in some embodiments of the present invention, the calculating the cost according to the first picture data and the second picture data to obtain a cost value includes:

and according to the first picture data and the second picture data, performing cost calculation through blocks with different arrangements to obtain a cost value, wherein each thread in each block correspondingly processes one pixel.

Optionally, in some embodiments of the present invention, the performing, according to the cost value, cost aggregation calculation in the first direction, the second direction, and the third direction to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction, and a cost aggregation value in the third direction includes:

according to the cost value, synchronously performing cost aggregation calculation in the first direction, the second direction and the third direction through an SGM binocular image matching algorithm to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction;

and performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction, wherein the cost aggregation calculation comprises the following steps:

and according to the cost value, performing cost aggregation calculation in the fourth direction through an SGM binocular image matching algorithm to obtain a cost aggregation value in the fourth direction.

Optionally, in some embodiments of the present invention, the performing cost aggregation calculation in the first direction, the second direction and the third direction according to the cost value to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction, and performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction includes:

according to the cost values, determining cost values in a first direction, a second direction, a third direction and a fourth direction;

according to the cost value of the first direction, the cost value of the second direction and the cost value of the third direction, synchronously performing cost aggregation calculation of the first direction, the second direction and the third direction through a butterfly sequencing algorithm to obtain a cost aggregation value of the first direction, a cost aggregation value of the second direction and a cost aggregation value of the third direction;

and according to the cost value in the fourth direction, performing cost aggregation calculation in the fourth direction through a butterfly sequencing algorithm to obtain a cost aggregation value in the fourth direction.

Optionally, in some embodiments of the present invention, the determining the disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction, and the cost aggregation value in the fourth direction includes:

Accumulating the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction under the condition of different parallax values to obtain accumulated aggregation values corresponding to the different parallax values;

and determining the minimum value in the accumulated aggregate values corresponding to the different parallax values as the parallax value through a butterfly ordering algorithm.

A second aspect of the present invention provides a processing apparatus, which may include:

the acquisition module is used for acquiring first picture data and second picture data, wherein the first picture data and the second picture data are acquired by different cameras respectively;

the processing module is used for carrying out cost calculation according to the first picture data and the second picture data to obtain cost value; synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction; and determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module is specifically configured to perform cost calculation through blocks arranged differently according to the first picture data and the second picture data, so as to obtain a cost value, where each thread in each block processes a pixel correspondingly.

Alternatively, in some embodiments of the invention,

the processing module is specifically configured to synchronously perform cost aggregation calculation in a first direction, a second direction and a third direction according to the cost value through an SGM binocular image matching algorithm, so as to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; and according to the cost value, performing cost aggregation calculation in the fourth direction through an SGM binocular image matching algorithm to obtain a cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module is specifically configured to determine a cost value in a first direction, a cost value in a second direction, a cost value in a third direction, and a cost value in a fourth direction according to the cost value; according to the cost value of the first direction, the cost value of the second direction and the cost value of the third direction, synchronously performing cost aggregation calculation of the first direction, the second direction and the third direction through a butterfly sequencing algorithm to obtain a cost aggregation value of the first direction, a cost aggregation value of the second direction and a cost aggregation value of the third direction; and according to the cost value in the fourth direction, performing cost aggregation calculation in the fourth direction through a butterfly sequencing algorithm to obtain a cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module is specifically configured to accumulate the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction, and the cost aggregation value in the fourth direction under the condition of different parallax values, so as to obtain accumulated aggregation values corresponding to the different parallax values; and determining the minimum value in the accumulated aggregate values corresponding to the different parallax values as the parallax value through a butterfly ordering algorithm.

A third aspect of the present invention provides a processing apparatus, which may include:

the device comprises a transceiver, a processor and a memory, wherein the transceiver, the processor and the memory are connected through a bus;

the memory is used for storing operation instructions;

the transceiver is used for acquiring first picture data and second picture data, wherein the first picture data and the second picture data are respectively acquired by different cameras;

the processor is configured to invoke the operation instruction to perform the steps of the method of the GPU-based binocular matching algorithm according to the first aspect of the present invention and any optional implementation of the first aspect.

A fourth aspect of the present invention provides a readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the steps of a method of GPU-based binocular matching algorithms as described in the first aspect and any of the alternative implementations of the first aspect of the present invention.

From the above technical solutions, the embodiment of the present invention has the following advantages:

in the embodiment of the invention, first picture data and second picture data are acquired, wherein the first picture data and the second picture data are respectively acquired by different cameras; according to the first picture data and the second picture data, cost calculation is carried out, and cost value is obtained; synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction; and determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction. The method utilizes the characteristic that the GPU performs parallel operation and is suitable for large-scale calculation, introduces the GPU into the binocular matching algorithm, improves the operation efficiency of the image matching algorithm in binocular vision, and improves the instantaneity of the binocular matching algorithm.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments and the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings.

FIG. 1 is a schematic diagram of a fusion of cost aggregation and parallax computation in an embodiment of the present invention;

FIG. 2 is a schematic diagram of one embodiment of a method of a GPU-based binocular matching algorithm in an embodiment of the present invention;

FIG. 3A is a timing diagram of stream of a GPU according to an embodiment of the present invention;

FIG. 3B is a schematic diagram of cost calculation according to an embodiment of the present invention;

FIG. 3C is a schematic diagram of a cost aggregation design according to an embodiment of the present invention;

FIG. 3D is a schematic diagram of cost calculation optimization in an embodiment of the present invention;

FIG. 3E is a diagram illustrating a length of an array shared_base storing reference pixels according to an embodiment of the present invention;

FIG. 3F is a diagram of a butterfly ordering algorithm in accordance with an embodiment of the invention;

FIG. 4 is a schematic diagram of an embodiment of a processing device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, reference will now be made to the accompanying drawings in which embodiments of the invention are illustrated, it being apparent that the embodiments described are only some, but not all, of the embodiments of the invention. Based on the embodiments of the present invention, it should be understood that the present invention is within the scope of protection.

The invention utilizes the parallel operation of the graphic processor (Graphics Processing Unit, GPU) and is suitable for large-scale calculation, and introduces the GPU into the binocular matching algorithm to promote the real-time performance of the binocular matching algorithm. But in order for the binocular matching algorithm to perform better, the implementation on the GPU was redesigned. Namely, a series of GPU optimization schemes are designed aiming at the operation process of the binocular matching algorithm. Firstly, in the overall algorithm architecture, in order to improve the data recycling rate, the cost aggregation and parallax calculation are fused together, as shown in fig. 1, which is a schematic diagram of the fusion of the cost aggregation and the parallax calculation in the embodiment of the invention.

In the following, by way of example, the technical solution of the present invention is further described, as shown in fig. 2, which is a schematic diagram of an embodiment of a method for implementing a GPU-based binocular matching algorithm in an embodiment of the present invention, and may include:

201. and acquiring first picture data and second picture data, wherein the first picture data and the second picture data are respectively acquired by different cameras.

It will be appreciated that the first picture data and the second picture data are acquired by different cameras respectively, and then the first picture data may be acquired by a left camera and the second picture data may be acquired by a right camera. Different GPU resource allocation schemes can be designed according to different operation flows of different binocular matching algorithms. Fig. 3A is a timing diagram of a stream (stream) of a GPU according to an embodiment of the present invention. The invention designs 3 parallel ideas of streams and improves the operation efficiency of the matching algorithm. As shown in fig. 3A, in the cost aggregation stage, the GPU is allowed to compute costs in 3 directions simultaneously. The invention fuses the cost aggregation and the parallax computation, so that the parallax computation operation can be executed only after the computation in other directions in the cost aggregation is completed, and the task in the same stream follows the principle of sequential execution, so that the cost aggregation '≡' in the first three directions is ensured to be completed.

In the cost calculation stage, because a single calculation step for calculating the cost is quite simple and parallel operation is easy to carry out, the design thought is simpler, practical and efficient, and each thread is responsible for one pixel. Thus, one instruction cycle can directly process one picture. Thus, a two-dimensional block is first designed, with each block containing 32 threads in the x-direction and 32 threads in the y-direction, where it is understood that 32 is exactly the number of threads of a thread bundle (warp). Next, a two-dimensional thread grid (grid) is set, the x-direction of the grid will contain the cols/blockdim.x blocks, and the y-direction of the grid will contain the cols/blockdim.y blocks. Fig. 3B is a schematic diagram of cost calculation according to an embodiment of the present invention.

In the cost aggregation stage, each block is designed to process the optimal parallax of two pixels, and in order to improve the efficiency of data interaction, the calculation of all parallax sizes in a certain direction of one pixel is completed by one warp by utilizing the principle that threads (threads) in the same warp can directly share data with each other. One warp contains 32 threads at maximum, and the invention designs to complete the calculation of the total parallax cost of one pixel by one warp, so that two warp is needed for processing two pixels in total, and thus 64 threads are contained in one block. To meet the total disparity penalty calculated by 32 threads, each thread is designed to handle max_ DISPARITY/32 disparities. The overall idea is similar to cost calculation, each block processes two rows of pixels, other pixels in one row are processed step by using for loops in the blocks, resource result distribution is shown in fig. 3C, and fig. 3C is a schematic design diagram of cost aggregation in the embodiment of the invention.

It should be noted that the GPU architecture is also different for different directions, but the overall idea is to process one row or one column per grid and two pixels per blocks. Namely, the operation efficiency of the binocular vision matching algorithm is remarkably improved through the binocular vision matching optimization algorithm based on the GPU. For example, the image processing speed reaches 42FPS on the processor of the inflight TX2, and the method can be applied to an unmanned aerial vehicle obstacle avoidance system.

202. And according to the first picture data and the second picture data, carrying out cost calculation to obtain cost value.

The calculating the cost according to the first picture data and the second picture data to obtain a cost value may include: and according to the first picture data and the second picture data, performing cost calculation through blocks with different arrangements to obtain a cost value, wherein each thread in each block correspondingly processes one pixel.

In the following, a brief description of cost calculation is provided, and in the embodiment of the present invention, cost calculation may be performed by using a census transform that is centrosymmetric, and the method may reduce a certain amount of memory under the condition that the light influence resistance is as good as that of the traditional census. The invention utilizes the parallelism of the GPU to realize optimization on central symmetry census transformation, and the specific optimization thinking is as follows:

First a two-dimensional thread block (block) is designed, each block containing 32 threads in the x-direction and 32 threads in the y-direction. It will be appreciated that 32 is chosen because this is exactly one warp thread number. Next, a two-dimensional grid is set, the x-direction of the grid will contain the cols/blockdim.x blocks, and the y-direction of the grid will contain the cols/blockdim.y blocks.

For central symmetry census transformation, because a single calculation step is quite simple and parallel operation is easy to carry out, the design thought is simpler, practical and efficient, and each thread is responsible for one pixel. Thus, one instruction cycle can directly process one picture. The resource structure of the GPU is illustrated in fig. 3B, and the resolution of the image acquired by the camera is 640×480. Here, a two-dimensional block is first designed, with each block containing 32 threads in the x-direction and 32 threads in the y-direction, which is chosen 32, since this is exactly the number of threads of a warp. Next, a two-dimensional grid is set, the x-direction of the grid will contain the cols/blockdim.x blocks, the y-direction of the grid will contain the cols/blockdim.y blocks, taking 640 x 480 as an example, the x-direction of the grid will contain the cols/32=20 blocks, and the y-direction of the grid will contain the row/32=15 blocks. In this way, just one pixel point can be processed by each thread.

The cost calculation is then carried out, and the whole matching cost space is known to be W multiplied by H multiplied by D by the binocular vision theory, so that the invention is also designed according to the thought. FIG. 3D is a schematic diagram of cost calculation optimization according to an embodiment of the present invention.

Since the space of the conventional cost calculation is w×h×d, and the parallax cost of a certain pixel in a certain direction is actually one pixel cost, the conventional cost calculation implementation will have a large amount of memory redundancy. In order to improve the data recycling rate to the maximum extent and reduce the data storage amount, the invention does not enable each thread to process one pixel like the census transformation, but enables each block to process one row of pixels, and uses for loops in the block to gradually process other pixels in one row, and all parallaxes of each pixel are obtained by threads at one time. I.e., H blocks will be allocated per grid, D threads will be allocated per block, and the for loop in the block will repeat W times.

For example, 480 blocks may be designed, with 128 threads in each block. So that each block is responsible for processing a row of pixels. In this way, the variable data structure corresponding to the present invention is also different from the conventional scheme, the length of the array shared_base storing the reference pixels is D, the length of the array shared_match storing the pixels to be compared is 2D, the data structure is shown in fig. 3E below, and fig. 3E is a schematic diagram of the length of the array shared_base storing the reference pixels in the embodiment of the present invention. Shared_match stores the cost of the previous 128 pixels for the previous segment, and the next D stores the cost of the next 128 pixels.

203. And synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction.

The step of synchronously performing cost aggregation calculation in the first direction, the second direction and the third direction according to the cost value to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction may include: according to the cost value, the cost aggregation calculation of the first direction, the second direction and the third direction is synchronously carried out through an SGM binocular image matching algorithm, and the cost aggregation value of the first direction, the cost aggregation value of the second direction and the cost aggregation value of the third direction are obtained.

It should be noted that, for cost aggregation, the aggregation theory of the present invention adopts an SGM (semi-global mapping) binocular image matching algorithm, and in this process, cost aggregation needs to be performed in multiple directions. Therefore, in order to further improve the operation efficiency, participation costs of using multiple streams can be aggregated. Illustratively, in this example, only four directional costs are aggregated, so the 'to' cost aggregate value is processed by stream0, the 'to' cost aggregate value is processed by stream1, and the 'to' cost aggregate value is processed by stream 2. However, in order to increase the data utilization rate, the invention does not divide the cost aggregation value of ' ++ ' into stream3, but uses stream1, because in order to increase the memory utilization rate, the calculation task of calculating the optimal parallax is completed while the calculation of ' ++number direction is performed, so that the calculation of other directions must be ensured to be completed before the operation can be performed, while the task in the same stream follows the principle of sequential execution, thus forming the design thought that 3 directions are calculated in parallel, and parallax optimization is performed after the calculation of the cost calculation of the fourth direction and the like.

Since the SGM algorithm needs to complete the following formula calculation:

L _r (p，d)＝C(p，d)+min(L _r (p-r，d)，

L _r (p-r，d-1)+P ₁

L _r (p-r，d+1)+P ₁ ，

wherein, in the above formula, the meaning indicated by each parameter is as follows:

lr (p, d): a cost aggregation value in the r direction of a certain matching point P;

c (p, d): the matching cost value of a certain matching point P;

lr (p-r, d): a matching cost aggregate value under the same parallax of one matching point on a certain matching point;

lr (p-r, d-1): a match cost aggregate value of parallax of one match point on a certain match point minus one match point;

lr (p-r, d+1): adding a matching cost aggregation value to the parallax of one matching point on a certain matching point;

MinLr (p-r, i): the minimum value of cost aggregation for all disparities of a matching point at a certain matching point (the formula of k is the same later);

p1, P2: parameters can be adjusted, and the parallax is compensated for to fine tune the algorithm.

The cost of one pixel needs to relate to adjacent pixel points, the cost of adjacent parallaxes of the adjacent pixel points, the current parallax cost of the adjacent pixel points and the minimum value of the cost of all parallaxes of the adjacent pixel points need to be compared, so that more data reuse and data interaction exist here, each thread is enabled to process one parallax cost as in the traditional scheme, the parallel operation efficiency can be reduced due to data communication among different threads, the invention provides a more efficient processing scheme, different architecture design strategies are provided for different directions, and the design strategy can be shown by referring to FIG. 3C.

204. And carrying out cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction.

The performing the cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction may include: and according to the cost value, performing cost aggregation calculation in the fourth direction through an SGM binocular image matching algorithm to obtain a cost aggregation value in the fourth direction.

205. And determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction.

The determining the disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction, and the cost aggregation value in the fourth direction may include: accumulating the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction under the condition of different parallax values to obtain accumulated aggregation values corresponding to the different parallax values; and determining the minimum value in the accumulated aggregate values corresponding to the different parallax values as the parallax value through a butterfly ordering algorithm.

It can be understood that each block is designed to process the optimal parallax of two pixels, and in order to improve the efficiency of data interaction, the calculation of all parallax sizes in a certain direction of one pixel is completed by one warp by utilizing the principle that the thread in the same warp can directly share data with each other. While the current TX2 platform pascal (parallel) architecture contains a maximum of 32 threads, the present invention designs to use one warp to complete the calculation of the total parallax cost of one pixel, so that two warp is required for processing two pixels in total, and thus one block contains 64 threads. To meet the total disparity penalty calculated by 32 threads, each thread is designed to handle max_ DISPARITY/32 disparities. The overall idea would be similar to cost computation, with each block processing two lines of pixels, with the for loop in the block processing the other pixels in a line step by step. One dimension of grid will be designed, each containing the blocks of rows/2. Taking 640 x 480 image processing as an example, this grid would contain 480/2=239 blocks, each block would contain 64 threads, each thread handling the cost of 128/32=4 disparities.

According to SGM algorithm, it is necessary to calculate

And d ^* ＝min _d S (p, d). The conventional method is to count the numbers by a common sorting methodThe ordering between groups is such that a minimum is obtained, e.g. for bubbling at least the order n× (n-1)/2 is required. And for the GPU, higher efficiency can be achieved, the butterfly ordering algorithm is designed, and the ordering times are remarkably reduced by using the principle that the thread in the same warp can directly share data with each other through the_shuf_xor_sync instruction. Firstly, exchanging data between adjacent threads, comparing old data with new data in original threads in size, and finding out the maximum value of the data between the adjacent threads, so that the data volume is reduced by half; then, data exchange is carried out among 1 thread at intervals, and the minimum value among 4 continuous threads is found out through the round of comparison, so that the data volume is reduced by half; the comparison continues until the data of the last 32 threads is reduced to 1 data. The method can reduce comparison times and operation complexity, and the number of comparison needs to be less than one half of the number of comparison in each shift through the butterfly ordering operation, so that the comparison is only needed>

The times are needed, and the comparison times are greatly reduced. Fig. 3F is a schematic diagram of a butterfly ordering algorithm according to an embodiment of the invention.

Illustratively, the smallest disparity matching cost of the 128 disparity points is determined. Since one thread handles the matching cost of 4 disparities, the minimum value of the 4 disparities is first judged and compared, and a total of 3 comparisons are required. After this step, the 128 values would be scaled down to 32 values, which would need to be relatively large, with the 32 values in 32 different threads, respectively. Then, the shfl_xor_sync (val, 2) is used for the thread, that is, the value of the thread 2 is put at the position of the thread 0 through 1 lattice, and similarly, the value of the thread 2 is compared with the value of the thread 0 to obtain the minimum value of the

threads

0,1,2 and 3, and the other threads are similarly processed, so that the val value to be compared is reduced by half, and 8 val values are left. Similarly, after 4 shifts, 8 shifts, and 16 shifts, the sizes of the 32 threads are gradually compared.

In the embodiment of the invention, a series of depth optimization is provided aiming at a binocular vision matching algorithmAnd (3) optimizing the structural design scheme based on the GPU. The problem of searching the minimum matching cost aiming at the SGM algorithm is solved, and a butterfly ordering algorithm is provided, which only needs to operate

The optimal disparity can be found out once. The invention redesigns the cost data structure, and can protect and guarantee the data reuse to the greatest extent.

The embodiment of the invention provides a novel binocular vision image matching method based on a graphics processor (Graphics Processing Unit, GPU), which can greatly improve the operation efficiency and reduce the processing time on the premise of unchanged matching precision. The operation efficiency of an image matching algorithm in binocular vision can be improved, and the instantaneity of binocular depth perception technology can be improved.

As shown in fig. 4, which is a schematic diagram of an embodiment of a processing apparatus according to an embodiment of the present invention, may include:

an obtaining module 401, configured to obtain first picture data and second picture data, where the first picture data and the second picture data are obtained by different cameras respectively;

a processing module 402, configured to perform cost calculation according to the first picture data and the second picture data, so as to obtain a cost value; synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction; and determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module 402 is specifically configured to perform cost calculation through blocks with different arrangements according to the first picture data and the second picture data, so as to obtain a cost value, where each thread in each block processes a pixel correspondingly.

Alternatively, in some embodiments of the invention,

the processing module 402 is specifically configured to synchronously perform cost aggregation calculation in a first direction, a second direction, and a third direction according to the cost value through an SGM binocular image matching algorithm, so as to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction, and a cost aggregation value in the third direction; and according to the cost value, performing cost aggregation calculation in the fourth direction through an SGM binocular image matching algorithm to obtain a cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module 402 is specifically configured to determine a cost value in a first direction, a cost value in a second direction, a cost value in a third direction, and a cost value in a fourth direction according to the cost value; according to the cost value of the first direction, the cost value of the second direction and the cost value of the third direction, synchronously performing cost aggregation calculation of the first direction, the second direction and the third direction through a butterfly sequencing algorithm to obtain a cost aggregation value of the first direction, a cost aggregation value of the second direction and a cost aggregation value of the third direction; and according to the cost value in the fourth direction, performing cost aggregation calculation in the fourth direction through a butterfly sequencing algorithm to obtain a cost aggregation value in the fourth direction.

Alternatively, in some embodiments of the invention,

the processing module 402 is specifically configured to accumulate the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction, and the cost aggregation value in the fourth direction under different parallax values, to obtain accumulated aggregation values corresponding to the different parallax values; and determining the minimum value in the accumulated aggregate values corresponding to the different parallax values as the parallax value through a butterfly ordering algorithm.

As shown in fig. 5, which is a schematic diagram of an embodiment of a processing apparatus according to an embodiment of the present invention, the processing apparatus may include:

a transceiver 501, a processor 502, and a memory 503, wherein the transceiver 501, the processor 502, and the memory 503 are connected by a bus; it will be appreciated that the transceiver 501 may be an image capturer.

A memory 503 for storing operation instructions;

a transceiver 501, configured to acquire first picture data and second picture data, where the first picture data and the second picture data are acquired by different cameras respectively;

the processor 502 is configured to call the operation instruction, and perform the following steps:

Optionally, in some embodiments of the present invention, the processor 502 is configured to call the operation instruction, and perform the following steps:

accumulating the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction under the condition of different parallax values to obtain accumulated aggregation values corresponding to the different parallax values; and determining the minimum value in the accumulated aggregate values corresponding to the different parallax values as the parallax value through a butterfly ordering algorithm.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of a GPU-based binocular matching algorithm, comprising:

according to the first picture data and the second picture data, cost calculation is carried out through blocks with different arrangements, so that a cost value is obtained, wherein each block processes one row of pixels, other pixels in one row are processed step by using for circulation in the block, all parallaxes of each pixel are obtained once by threads, namely each grid is distributed with H blocks, each block is distributed with D threads, and for circulation in the block is repeated for W times;

determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction;

synchronously performing cost aggregation calculation in a first direction, a second direction and a third direction according to the cost value to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction, performing cost aggregation calculation in a fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction, including:

According to the cost value in the fourth direction, performing cost aggregation calculation in the fourth direction through a butterfly sequencing algorithm to obtain a cost aggregation value in the fourth direction;

the determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction includes:

2. The method of claim 1, wherein the step of synchronously performing cost aggregation calculation in the first direction, the second direction, and the third direction according to the cost value to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction, and a cost aggregation value in the third direction includes:

3. A processing apparatus, comprising:

the processing module is used for carrying out cost calculation through blocks with different arrangements according to the first picture data and the second picture data to obtain cost values, wherein each block processes one row of pixels, other pixels in one row are processed step by using for loops in the blocks, all parallaxes of each pixel are obtained by threads at one time, namely each grid is distributed with H blocks, each block is distributed with D threads, and for loops in the blocks are repeated for W times; synchronously carrying out cost aggregation calculation in the first direction, the second direction and the third direction according to the cost values to obtain a cost aggregation value in the first direction, a cost aggregation value in the second direction and a cost aggregation value in the third direction; performing cost aggregation calculation in the fourth direction according to the cost value to obtain a cost aggregation value in the fourth direction; determining a disparity value according to the cost aggregation value in the first direction, the cost aggregation value in the second direction, the cost aggregation value in the third direction and the cost aggregation value in the fourth direction;

The processing module is specifically configured to determine a cost value in a first direction, a cost value in a second direction, a cost value in a third direction, and a cost value in a fourth direction according to the cost value; according to the cost value of the first direction, the cost value of the second direction and the cost value of the third direction, synchronously performing cost aggregation calculation of the first direction, the second direction and the third direction through a butterfly sequencing algorithm to obtain a cost aggregation value of the first direction, a cost aggregation value of the second direction and a cost aggregation value of the third direction; according to the cost value in the fourth direction, performing cost aggregation calculation in the fourth direction through a butterfly sequencing algorithm to obtain a cost aggregation value in the fourth direction;

4. A processing apparatus according to claim 3, wherein,

5. A processing apparatus, comprising:

the memory is used for storing operation instructions;

the processor is configured to invoke the operation instruction and perform the steps of the method of the GPU-based binocular matching algorithm according to claim 1 or 2.

6. A readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the method of the GPU-based binocular matching algorithm of claim 1 or 2.