CN116740161B

CN116740161B - Binocular stereo matching aggregation method

Info

Publication number: CN116740161B
Application number: CN202311013015.7A
Authority: CN
Inventors: 戴齐飞; 曾鹏程; 钱刃; 杨文帮; 赵勇; 李福池
Original assignee: Dongguan Aipeike Technology Co ltd
Current assignee: Dongguan Aipeike Technology Co ltd
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-11-28
Anticipated expiration: 2043-08-14
Also published as: CN116740161A

Abstract

An aggregation method for binocular stereo matching relates to the field of stereo matching. The method comprises the following steps: feature extraction is carried out on the left view and the right view to generate pyramid cost volumes, and a first scale feature map, a second scale feature map and a third scale feature map are correspondingly determined by utilizing the pyramid cost volumes; respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map of the first scale feature map; respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map of the second scale feature map; intra-scale aggregation is carried out on the parallax feature images of the first scale feature image and the parallax feature images of the second scale feature image so as to generate parallax prediction images corresponding to the scale feature images; performing intra-scale aggregation on the third scale feature map to generate a corresponding parallax prediction map; and generating a parallax image according to the parallax prediction image corresponding to each scale characteristic image.

Description

Binocular stereo matching aggregation method

Technical Field

The application relates to the field of stereo matching, in particular to an aggregation algorithm for binocular stereo matching.

Background

Binocular vision is the restoration of depth information in a three-dimensional scene by calculating the disparity of left and right views. The introduction of the neural network enables the estimation of binocular vision to achieve higher accuracy. However, the current binocular vision stereo matching technology has a plurality of limiting factors, on one hand, the precision and the speed are difficult to balance, a complex network structure is adopted by a high-precision stereo matching algorithm, a large amount of redundant calculation exists, the intelligent driving real-time landing requirement cannot be met, the real-time stereo matching algorithm is often limited by the influence of pathological areas such as weak textures and shielding, and the defect of algorithm precision exists. On the other hand, the method is limited by the scale of a real intelligent driving data set and the disadvantage that an RGB camera is easily affected by illumination, and a stereo matching algorithm is difficult to process complex extreme scenes, so that the problem of how to cope with domain offset phenomenon and improve the generalization capability of the algorithm is urgently solved.

Disclosure of Invention

The application mainly solves the technical problem of providing a binocular stereo matching polymerization method capable of determining a pathological area more accurately.

According to a first aspect, in one embodiment, there is provided an aggregation method for binocular stereo matching, including:

feature extraction is carried out on the left view and the right view to generate a pyramid cost volume; the pyramid cost volume comprises a first resolution cost volume, a second resolution cost volume and a third resolution cost volume; wherein the resolution of the first resolution cost volume is greater than the resolution of the second resolution cost volume and greater than the resolution of the third resolution cost volume;

determining a first scale feature map according to the first resolution cost volume, determining a second scale feature map according to the second resolution cost volume, and determining a third scale feature map according to the third resolution cost volume;

respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the first scale feature map;

respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the second scale feature map;

performing intra-scale aggregation on the parallax feature map corresponding to the first scale feature map to generate a parallax prediction map corresponding to the first scale feature map;

performing intra-scale aggregation on the parallax feature map corresponding to the second scale feature map to generate a parallax prediction map corresponding to the second scale feature map;

performing intra-scale aggregation on the third-scale feature map to generate a parallax prediction map corresponding to the third-scale feature map;

generating a parallax map according to the parallax prediction map corresponding to the first scale feature map, the parallax prediction map corresponding to the second scale feature map and the parallax prediction map corresponding to the third scale feature map.

In one embodiment, the first scale feature map comprises a 1/4 scale feature map, the second scale feature map comprises a 1/8 scale feature map, and the third scale feature map comprises a 1/16 scale feature map.

In an embodiment, after the first scale feature map and the second scale feature map are respectively rearranged and sliced, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the first scale feature map, which includes:

re-slicing the 1/4 scale feature map to obtain a 1/4 slice feature map;

re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;

aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map;

re-slicing the 1/4 cross-scale aggregation feature map to obtain a 1/4 cross-scale aggregation slice feature map;

and aggregating the 1/4 trans-scale aggregation slice feature images by using a self-attention mechanism to obtain a parallax feature image corresponding to the 1/4-scale feature image.

In an embodiment, after the re-slicing the second scale feature map and the third scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the second scale feature map, which includes:

re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;

re-slicing the 1/16 scale feature map to obtain a 1/16 slice feature map;

aggregating the 1/8 slice feature map and the 1/16 slice feature map by using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregation feature map;

re-slicing the 1/8 cross-scale aggregation feature map to obtain a 1/8 cross-scale aggregation slice feature map;

and aggregating the 1/8 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/8-scale feature map.

In one embodiment, the intra-scale aggregating the parallax feature map corresponding to the first scale feature map to generate the parallax prediction map corresponding to the first scale feature map includes

Acquiring the feature information of different levels of parallax feature images corresponding to the 1/4 scale feature images;

carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/4 scale feature map so as to extract the feature information of the pathological region under the 1/4 scale;

fitting the characteristic information of the 1/4-scale pathological region to a parallax characteristic map corresponding to the 1/4-scale characteristic map to generate a parallax prediction map corresponding to the 1/4-scale characteristic map.

In an embodiment, the performing intra-scale aggregation on the parallax feature map corresponding to the second scale feature map to generate a parallax prediction map corresponding to the second scale feature map includes:

acquiring the feature information of different levels of parallax feature images corresponding to the 1/8 scale feature images;

carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/8 scale feature map so as to extract the feature information of the pathological region under the 1/8 scale;

fitting the characteristic information of the pathological region under the 1/8 scale to a parallax characteristic map corresponding to the 1/8 scale characteristic map to generate a parallax prediction map corresponding to the 1/8 scale characteristic map.

In one embodiment, obtaining feature information of different levels of parallax feature maps corresponding to different scale feature maps includes:

and acquiring the characteristic information of different levels of parallax characteristic images corresponding to the characteristic images of different scales by utilizing hourglass convolution.

In an embodiment, the performing intra-scale aggregation on the third scale feature map to generate a parallax prediction map corresponding to the third scale feature map includes:

acquiring the characteristic information of different layers of the 1/16 scale characteristic map by utilizing hourglass convolution;

carrying out mean pooling and maximum pooling on the feature information of different layers of the 1/16 scale feature map so as to extract the feature information of a disease area under 1/16 scale;

fitting the characteristic information of the pathological region under the 1/16 scale to a parallax characteristic map corresponding to the 1/16 scale characteristic map to generate a parallax prediction map corresponding to the 1/16 scale characteristic map.

In an embodiment, the generating a disparity map according to the disparity prediction map corresponding to the first scale feature map, the disparity prediction map corresponding to the second scale feature map, and the disparity prediction map corresponding to the third scale feature map includes:

and calculating a parallax prediction graph corresponding to the first scale feature graph, a parallax prediction graph corresponding to the second scale feature graph and a parallax prediction graph corresponding to the third scale feature graph by using softmax to generate a parallax graph.

According to a second aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement an aggregation method of binocular stereo matching as described above.

According to the aggregation method and the computer-readable storage medium for binocular stereo matching of the embodiment, the left view and the right view of the binocular camera are subjected to feature extraction to generate pyramid cost volumes, and the first scale feature map, the second scale feature map and the third scale feature map are determined according to pyramid cost volumes. Inter-scale aggregation is carried out on the first scale feature map and the second scale feature map to obtain a parallax feature map corresponding to the first scale feature map, and inter-scale aggregation is carried out on the second scale feature map and the third scale feature map to obtain a parallax feature map corresponding to the second scale feature map. And then, carrying out intra-scale aggregation on the parallax feature images corresponding to the first scale feature images to obtain parallax prediction images corresponding to the first scale feature images, carrying out intra-scale aggregation on the parallax feature images corresponding to the second scale feature images to obtain parallax prediction images corresponding to the second scale feature images, and carrying out intra-scale aggregation on the third scale feature images to obtain parallax prediction images corresponding to the third scale feature images. And finally, generating a parallax image according to the parallax prediction image corresponding to the first scale feature image, the parallax prediction image corresponding to the second scale feature image and the parallax prediction image corresponding to the third scale feature image. According to the application, inter-scale aggregation is carried out on each scale feature map, so that the detail information and semantic information of each scale feature map can be integrated, the global receptive field can be perceived, and more accurate and complete feature maps can be obtained. And then carrying out intra-scale polymerization on each scale characteristic map, and screening out important pathological areas so as to improve the performance of the whole binocular stereo matching.

Drawings

FIG. 1 is a general flow chart of an aggregation algorithm for binocular stereo matching of one embodiment;

FIG. 2 is a network structure used in a cost aggregation stage of a binocular stereo matching aggregation algorithm according to an embodiment;

FIG. 3 is a network structure diagram of inter-scale aggregation in a cost aggregation phase of one embodiment;

FIG. 4 is a sub-flowchart of step S200 in the binocular stereo matching aggregation algorithm of one embodiment;

FIG. 5 is a sub-flowchart of step S210 in the binocular stereo matching aggregation algorithm of one embodiment;

FIG. 6 is a sub-flowchart of step S220 in the binocular stereo matching aggregation algorithm of one embodiment;

FIG. 7 is a diagram of a network structure for intra-scale aggregation in a cost aggregation phase of one embodiment;

FIG. 8 is a sub-flowchart of step S300 in the aggregation algorithm of binocular stereo matching of one embodiment;

FIG. 9 is a sub-flowchart of step S310 in the binocular stereo matching aggregation algorithm of one embodiment;

FIG. 10 is a sub-flowchart of step S320 in the binocular stereo matching aggregation algorithm of one embodiment;

fig. 11 is a sub-flowchart of step S330 in the aggregation algorithm of binocular stereo matching according to one embodiment.

Detailed Description

The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.

Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.

The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning.

The application provides a binocular stereo matching aggregation algorithm which comprises four parts of feature extraction, cost roll construction, cost aggregation and parallax refinement. Please refer to fig. 1, which is a flowchart of an aggregation algorithm for binocular stereo matching, which specifically includes the following steps.

In the feature extraction and cost volume construction stage, step S100 is adopted: feature extraction is performed on the left view and the right view to generate a pyramid cost volume.

In some embodiments, feature extraction is performed on left and right views acquired by a binocular stereo matching camera to generate a pyramid cost volume comprising a first resolution cost volume, a second resolution cost volume, and a third resolution cost volume. And determining a first scale feature map according to the first resolution cost volume, determining a second scale feature map according to the second resolution cost volume, and determining a third scale feature map according to the third resolution cost volume.

In some embodiments, the pyramid cost volume includes 1/4, 1/8, and 1/16 resolutions, so that the corresponding generated first scale feature map is a 1/4 scale feature map, the second scale feature map is a 1/8 scale feature map, and the third scale feature map is a 1/16 scale feature map.

The pyramid cost volume in the application refers to that in a convolution neural network, a left view and a right view which are input by convolution check with different scales are used for carrying out multi-layer convolution operation, so that feature images with different depth layers are obtained. And then combining the characteristic groups into a pyramid shape, and performing cost convolution operation to improve the recognition capability and recognition accuracy of the convolutional neural network on the changes of different scales, different angles, different illumination and the like of the target object.

In some embodiments, the feature map is a 1/4 scale feature map, a 1/8 scale feature map, and a 1/16 scale feature map, respectively, that correspond from shallow to deep. And extracting shallow features by using the shallow feature map, wherein the shallow features comprise more pixel information, such as information of some colors, textures, edges and angles of the image. Deep features are extracted by utilizing the deep feature map, and more semantic information is contained.

In the cost aggregation stage, please refer to fig. 2 for the network structure adopted in the cost aggregation stage, inter-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map, and then intra-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map. The specific development is performed below.

Please refer to fig. 3, which is a network structure diagram of inter-scale aggregation, the following steps are adopted to perform inter-scale aggregation.

Step S200: inter-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map.

In some embodiments, please refer to fig. 4, which is a specific flowchart of step S200, when performing step S200 to inter-scale aggregate the first scale feature map, the second scale feature map, and the third scale feature map, the method includes the following steps.

Step S210: and respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the first scale feature map.

In some embodiments, when the first scale feature map and the second scale feature map are respectively rearranged and sliced, at this time, a 1/4 scale feature map corresponding to the first scale feature map is a high scale feature map, and a 1/8 scale feature map corresponding to the second scale feature map is a low scale feature map.

In some embodiments, please refer to fig. 5, which is a specific flowchart of step S210, after performing step S210 to reorder and slice the first scale feature map and the second scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the first scale feature map, which includes the following steps.

Step S211: and respectively carrying out rearrangement slicing on the 1/4 scale characteristic map and the 1/8 scale characteristic map to correspondingly obtain a 1/4 slice characteristic map and a 1/8 slice characteristic map.

The rearrangement slicing operation is to cut and rearrange the high-scale feature images according to a certain rule to obtain lower-scale feature images, and the operation can help the network to learn the multi-scale features better. In the convolutional neural network, feature maps with different scales have different scales and semantic information, so that 1/4-scale feature maps and 1/8-scale feature maps are rearranged and sliced according to a set rule to obtain the scale and semantic information of the 1/4-scale feature maps and the scale and semantic information of the 1/8-scale feature maps respectively.

Step S212: and aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map.

The 1/8 slice feature map is aggregated into a 1/4 slice feature map using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregated feature map. The cross-scale attention mechanism is utilized to integrate the feature map information of different scales, so that a more complete and accurate high-scale feature map is obtained, the recognition and positioning capability of objects of different scales can be improved, and the performance and accuracy of binocular stereo matching are further improved. Meanwhile, a cross-scale attention mechanism is used at a lower scale, so that the memory can be effectively saved. The cross-scale attention mechanism is different from a common multi-scale fusion method, the global receptive field can be perceived, the low-scale feature map is utilized to guide the learning of the high-scale feature map, the calculation complexity is reduced according to the matrix sparsification characteristic, and the features are staggered to construct long-range and short-range features.

Step S213: and re-slicing the 1/4 cross-scale aggregation feature map to obtain a 1/4 cross-scale aggregation slice feature map.

Step S214: and aggregating the 1/4 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/4-scale feature map.

The re-slicing operation results in lower scale feature maps, but these feature maps may lose some significant semantic information. Therefore, the feature map is restored through the self-attention mechanism, namely semantic information in the feature map is re-enhanced and integrated through the self-attention mechanism, so that a more accurate and complete feature map is obtained.

Step S220: and respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the second scale feature map.

In some embodiments, when the second scale feature map and the third scale feature map are respectively rearranged and sliced, at this time, a 1/8 scale feature map corresponding to the second scale feature map is a high scale feature map, and a 1/16 scale feature map corresponding to the third scale feature map is a low scale feature map.

In some embodiments, please refer to fig. 6, which is a specific flowchart of step S220, after performing step S220 to reorder and slice the second scale feature map and the third scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the second scale feature map, which includes the following steps.

Step S221: and respectively carrying out rearrangement slicing on the 1/8 scale characteristic map and the 1/16 scale characteristic map to correspondingly obtain a 1/8 slice characteristic map and a 1/16 slice characteristic map.

And cutting and rearranging the high-scale feature images according to a certain rule by using a rearrangement slicing operation to obtain lower-scale feature images, wherein the operation can help the network to learn the multi-scale features better. In the convolutional neural network, feature maps with different scales have different scales and semantic information, so that 1/8-scale feature maps and 1/16-scale feature maps are rearranged and sliced according to a set rule to obtain the scale and semantic information of the 1/8-scale feature maps and the scale and semantic information of the 1/16-scale feature maps respectively.

Step S222: and aggregating the 1/8 slice feature map and the 1/16 slice feature map by using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregation feature map.

The 1/16 slice feature map is aggregated into a 1/8 slice feature map using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregated feature map. The cross-scale attention mechanism is utilized to integrate the feature map information of different scales, so that a more complete and accurate high-scale feature map is obtained, the recognition and positioning capability of objects of different scales can be improved, and the performance and accuracy of binocular stereo matching are further improved. Meanwhile, a cross-scale attention mechanism is used at a lower scale, so that the memory can be effectively saved. The cross-scale attention mechanism is different from a common multi-scale fusion method, the global receptive field can be perceived, the low-scale feature map is utilized to guide the learning of the high-scale feature map, the calculation complexity is reduced according to the matrix sparsification characteristic, and the features are staggered to construct long-range and short-range features.

Step S223: and re-slicing the 1/8 cross-scale aggregation feature map to obtain a 1/8 cross-scale aggregation slice feature map.

Step S224: and aggregating the 1/8 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/8-scale feature map.

In some embodiments, the cross-scale attention mechanism and the self-attention mechanism are calculated using the following formulas:

，

wherein, attention represents the Attention mechanism, Q represents the key feature, K represents the value feature, V represents the query feature, the above three are convolved by input feature 1*1, T represents the matrix transpose, softmax represents the normalized exponential function.

Please refer to fig. 7, which is a diagram of an intra-scale aggregation network structure, and the following steps are adopted to perform the intra-scale aggregation.

Step S300: and carrying out intra-scale polymerization on the inter-scale polymerization results of the first scale feature map, the second scale feature map and the third scale feature map.

In some embodiments, please refer to fig. 8, which is a specific flowchart of step S300, when performing step S300 to perform scale aggregation on the first scale feature map, the second scale feature map, and the third scale feature map, the method includes the following steps.

Step S310: and carrying out intra-scale aggregation on the parallax characteristic images corresponding to the first scale characteristic images to generate parallax prediction images corresponding to the first scale characteristic images.

In some embodiments, the parallax feature map corresponding to the first scale feature map is a parallax feature map corresponding to the 1/4 scale feature map.

In some embodiments, please refer to fig. 9, which is a specific flowchart of step S310, when performing step S310 to intra-scale aggregate the parallax characteristic map corresponding to the first scale characteristic map to generate a parallax prediction map corresponding to the first scale characteristic map, the method includes the following steps.

Step S311: and acquiring the characteristic information of different levels of the parallax characteristic map corresponding to the 1/4 scale characteristic map.

In some embodiments, the feature information of different levels of the parallax feature map corresponding to the 1/4 scale feature map is obtained by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.

Step S312: and carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature images corresponding to the 1/4 scale feature images so as to extract the feature information of the pathological region under the 1/4 scale.

In some embodiments, the hourglass-like convolution may extract high frequency features on different scales, while the pooling of the maxima and the pooling of the averages may screen out which features are valid. The maximum value is selected as representative in the pooling of the maximum value, and the average value is selected as representative in the pooling of the average value, so that the feature dimension is reduced, important information is reserved, the efficiency and the accuracy of the model are improved, and meanwhile the problems of overfitting and the like can be avoided.

Step S313: fitting the characteristic information of the pathological region under the 1/4 scale to the parallax characteristic map corresponding to the 1/4 scale characteristic map to generate a parallax prediction map corresponding to the 1/4 scale characteristic map.

Step S320: and carrying out intra-scale aggregation on the parallax characteristic images corresponding to the second scale characteristic images to generate parallax prediction images corresponding to the second scale characteristic images.

In some embodiments, the parallax feature map corresponding to the second scale feature map is a parallax feature map corresponding to the 1/8 scale feature map.

In some embodiments, please refer to fig. 10, which is a specific flowchart of step S320, when performing step S320 to intra-scale aggregate the parallax characteristic map corresponding to the second scale characteristic map to generate a parallax prediction map corresponding to the second scale characteristic map, the method includes the following steps.

Step S321: and acquiring the characteristic information of different levels of the parallax characteristic map corresponding to the 1/8 scale characteristic map.

In some implementations, the feature information of different levels of the parallax feature map corresponding to the 1/8 scale feature map is acquired by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.

Step S322: and carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature images corresponding to the 1/8 scale feature images so as to extract the feature information of the pathological region under the 1/8 scale.

Step S323: fitting the characteristic information of the pathological region under the 1/8 scale to the parallax characteristic map corresponding to the 1/8 scale characteristic map to generate a parallax prediction map corresponding to the 1/8 scale characteristic map.

Step S330: and carrying out intra-scale aggregation on the third-scale feature map to generate a parallax prediction map corresponding to the third-scale feature map.

In some embodiments, the parallax feature map corresponding to the third scale feature map is a parallax feature map corresponding to the 1/16 scale feature map.

In some embodiments, please refer to fig. 11, which is a specific flowchart of step S330, when performing step S310 to intra-scale aggregate the third scale feature map to generate a parallax prediction map corresponding to the third scale feature map, the method includes the following steps.

Step S331: and acquiring the characteristic information of different layers of the 1/16 scale characteristic map by utilizing hourglass convolution.

In some implementations, the feature information of different levels of the parallax feature map corresponding to the 1/16 scale feature map is acquired by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.

Step S332: and carrying out mean value pooling and maximum value pooling on the feature information of different layers of the 1/16 scale feature map so as to extract the feature information of the pathological region under the 1/16 scale.

Step S333: fitting the characteristic information of the pathological region under the 1/16 scale to the parallax characteristic map corresponding to the 1/16 scale characteristic map to generate a parallax prediction map corresponding to the 1/16 scale characteristic map.

In the parallax refinement stage, step S400 is employed: generating a parallax map according to the parallax prediction map corresponding to the first scale feature map, the parallax prediction map corresponding to the second scale feature map and the parallax prediction map corresponding to the third scale feature map.

In some embodiments, a disparity prediction map corresponding to the first scale feature map, a disparity prediction map corresponding to the second scale feature map, and a third scale feature map are calculated using softmax to generate a disparity map. The parallax prediction graph corresponding to the first scale feature graph is a parallax prediction graph corresponding to the 1/4 scale feature graph, the parallax prediction graph corresponding to the second scale feature graph is a parallax prediction graph corresponding to the 1/8 scale feature graph, and the parallax prediction graph corresponding to the third scale feature graph is a parallax prediction graph corresponding to the 1/16 scale feature graph.

The disparity map is obtained using the following formula:

，

wherein,for the disparity map, d is the disparity value in the cost volume, dmax is the upper limit of the disparity range, c _d For predicting the maximum disparity of the cost Dmax, σ is the channel summation and softmax represents the normalized exponential function.

Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.

The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims

1. A binocular stereo matching polymerization method, comprising:

feature extraction is carried out on the left view and the right view to generate a pyramid cost volume; the pyramid cost volume comprises a first resolution cost volume, a second resolution cost volume and a third resolution cost volume; the resolution of the first resolution cost volume is larger than that of the second resolution cost volume, and the resolution of the second resolution cost volume is larger than that of the third resolution cost volume;

determining a 1/4 scale feature map according to the first resolution cost volume, determining a 1/8 scale feature map according to the second resolution cost volume, and determining a 1/16 scale feature map according to the third resolution cost volume;

respectively carrying out rearrangement slicing on the 1/4 scale feature map and the 1/8 scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the 1/4 scale feature map;

the 1/4 scale feature map is rearranged and sliced to obtain a 1/4 slice feature map;

re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;

aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map; the cross-scale attention mechanism is calculated according to key features, value features and query features, and the key features, the value features and the query features are determined according to 1*1 convolution;

aggregating the 1/4 trans-scale aggregation slice feature images by using a self-attention mechanism to obtain parallax feature images corresponding to the 1/4-scale feature images;

respectively carrying out rearrangement slicing on the 1/8-scale feature map and the 1/16-scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the 1/8-scale feature map;

performing intra-scale aggregation on the parallax feature map corresponding to the 1/4-scale feature map to generate a parallax prediction map corresponding to the 1/4-scale feature map;

performing intra-scale aggregation on the parallax feature map corresponding to the 1/8-scale feature map to generate a parallax prediction map corresponding to the 1/8-scale feature map;

performing intra-scale aggregation on the 1/16-scale feature map to generate a parallax prediction map corresponding to the 1/16-scale feature map;

generating a parallax image according to the parallax prediction image corresponding to the 1/4 scale feature image, the parallax prediction image corresponding to the 1/8 scale feature image and the parallax prediction image corresponding to the 1/16 scale feature image.

2. The method for polymerizing binocular stereo matching according to claim 1, wherein the step of performing inter-scale polymerization to obtain a parallax feature map corresponding to the 1/8-scale feature map after performing rearrangement slicing on the 1/8-scale feature map and the 1/16-scale feature map, respectively, comprises:

re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;

re-slicing the 1/16 scale feature map to obtain a 1/16 slice feature map;

3. The method for aggregating binocular stereo matching of claim 1, wherein the intra-scale aggregating the disparity feature maps corresponding to the 1/4-scale feature maps to generate the disparity prediction map corresponding to the 1/4-scale feature maps comprises

4. The method for aggregating binocular stereo matching according to claim 2, wherein the intra-scale aggregating the disparity feature map corresponding to the 1/8-scale feature map to generate the disparity prediction map corresponding to the 1/8-scale feature map comprises:

5. The method for aggregation of binocular stereo matching according to any one of claims 3 to 4, wherein obtaining feature information of different levels of parallax feature maps corresponding to feature maps of different scales comprises:

6. The method for aggregating binocular stereo matching according to claim 1, wherein the performing intra-scale aggregation on the 1/16-scale feature map to generate a disparity prediction map corresponding to the 1/16-scale feature map includes:

7. The method for aggregation of binocular stereo matching according to claim 1, wherein the generating a disparity map from the disparity prediction map corresponding to the 1/4 scale feature map, the disparity prediction map corresponding to the 1/8 scale feature map, and the disparity prediction map corresponding to the 1/16 scale feature map comprises:

and calculating the parallax prediction graph corresponding to the 1/4 scale feature graph, the parallax prediction graph corresponding to the 1/8 scale feature graph and the parallax prediction graph corresponding to the 1/16 scale feature graph by using softmax to generate a parallax graph.

8. A computer readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the method of any of claims 1-7.