CN116740161B - Binocular stereo matching aggregation method - Google Patents

Binocular stereo matching aggregation method Download PDF

Info

Publication number
CN116740161B
CN116740161B CN202311013015.7A CN202311013015A CN116740161B CN 116740161 B CN116740161 B CN 116740161B CN 202311013015 A CN202311013015 A CN 202311013015A CN 116740161 B CN116740161 B CN 116740161B
Authority
CN
China
Prior art keywords
scale
feature map
map
feature
parallax
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311013015.7A
Other languages
Chinese (zh)
Other versions
CN116740161A (en
Inventor
戴齐飞
曾鹏程
钱刃
杨文帮
赵勇
李福池
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Aipeike Technology Co ltd
Original Assignee
Dongguan Aipeike Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongguan Aipeike Technology Co ltd filed Critical Dongguan Aipeike Technology Co ltd
Priority to CN202311013015.7A priority Critical patent/CN116740161B/en
Publication of CN116740161A publication Critical patent/CN116740161A/en
Application granted granted Critical
Publication of CN116740161B publication Critical patent/CN116740161B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/52Scale-space analysis, e.g. wavelet analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering

Abstract

An aggregation method for binocular stereo matching relates to the field of stereo matching. The method comprises the following steps: feature extraction is carried out on the left view and the right view to generate pyramid cost volumes, and a first scale feature map, a second scale feature map and a third scale feature map are correspondingly determined by utilizing the pyramid cost volumes; respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map of the first scale feature map; respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map of the second scale feature map; intra-scale aggregation is carried out on the parallax feature images of the first scale feature image and the parallax feature images of the second scale feature image so as to generate parallax prediction images corresponding to the scale feature images; performing intra-scale aggregation on the third scale feature map to generate a corresponding parallax prediction map; and generating a parallax image according to the parallax prediction image corresponding to each scale characteristic image.

Description

Binocular stereo matching aggregation method
Technical Field
The application relates to the field of stereo matching, in particular to an aggregation algorithm for binocular stereo matching.
Background
Binocular vision is the restoration of depth information in a three-dimensional scene by calculating the disparity of left and right views. The introduction of the neural network enables the estimation of binocular vision to achieve higher accuracy. However, the current binocular vision stereo matching technology has a plurality of limiting factors, on one hand, the precision and the speed are difficult to balance, a complex network structure is adopted by a high-precision stereo matching algorithm, a large amount of redundant calculation exists, the intelligent driving real-time landing requirement cannot be met, the real-time stereo matching algorithm is often limited by the influence of pathological areas such as weak textures and shielding, and the defect of algorithm precision exists. On the other hand, the method is limited by the scale of a real intelligent driving data set and the disadvantage that an RGB camera is easily affected by illumination, and a stereo matching algorithm is difficult to process complex extreme scenes, so that the problem of how to cope with domain offset phenomenon and improve the generalization capability of the algorithm is urgently solved.
Disclosure of Invention
The application mainly solves the technical problem of providing a binocular stereo matching polymerization method capable of determining a pathological area more accurately.
According to a first aspect, in one embodiment, there is provided an aggregation method for binocular stereo matching, including:
feature extraction is carried out on the left view and the right view to generate a pyramid cost volume; the pyramid cost volume comprises a first resolution cost volume, a second resolution cost volume and a third resolution cost volume; wherein the resolution of the first resolution cost volume is greater than the resolution of the second resolution cost volume and greater than the resolution of the third resolution cost volume;
determining a first scale feature map according to the first resolution cost volume, determining a second scale feature map according to the second resolution cost volume, and determining a third scale feature map according to the third resolution cost volume;
respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the first scale feature map;
respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the second scale feature map;
performing intra-scale aggregation on the parallax feature map corresponding to the first scale feature map to generate a parallax prediction map corresponding to the first scale feature map;
performing intra-scale aggregation on the parallax feature map corresponding to the second scale feature map to generate a parallax prediction map corresponding to the second scale feature map;
performing intra-scale aggregation on the third-scale feature map to generate a parallax prediction map corresponding to the third-scale feature map;
generating a parallax map according to the parallax prediction map corresponding to the first scale feature map, the parallax prediction map corresponding to the second scale feature map and the parallax prediction map corresponding to the third scale feature map.
In one embodiment, the first scale feature map comprises a 1/4 scale feature map, the second scale feature map comprises a 1/8 scale feature map, and the third scale feature map comprises a 1/16 scale feature map.
In an embodiment, after the first scale feature map and the second scale feature map are respectively rearranged and sliced, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the first scale feature map, which includes:
re-slicing the 1/4 scale feature map to obtain a 1/4 slice feature map;
re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;
aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map;
re-slicing the 1/4 cross-scale aggregation feature map to obtain a 1/4 cross-scale aggregation slice feature map;
and aggregating the 1/4 trans-scale aggregation slice feature images by using a self-attention mechanism to obtain a parallax feature image corresponding to the 1/4-scale feature image.
In an embodiment, after the re-slicing the second scale feature map and the third scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the second scale feature map, which includes:
re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;
re-slicing the 1/16 scale feature map to obtain a 1/16 slice feature map;
aggregating the 1/8 slice feature map and the 1/16 slice feature map by using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregation feature map;
re-slicing the 1/8 cross-scale aggregation feature map to obtain a 1/8 cross-scale aggregation slice feature map;
and aggregating the 1/8 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/8-scale feature map.
In one embodiment, the intra-scale aggregating the parallax feature map corresponding to the first scale feature map to generate the parallax prediction map corresponding to the first scale feature map includes
Acquiring the feature information of different levels of parallax feature images corresponding to the 1/4 scale feature images;
carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/4 scale feature map so as to extract the feature information of the pathological region under the 1/4 scale;
fitting the characteristic information of the 1/4-scale pathological region to a parallax characteristic map corresponding to the 1/4-scale characteristic map to generate a parallax prediction map corresponding to the 1/4-scale characteristic map.
In an embodiment, the performing intra-scale aggregation on the parallax feature map corresponding to the second scale feature map to generate a parallax prediction map corresponding to the second scale feature map includes:
acquiring the feature information of different levels of parallax feature images corresponding to the 1/8 scale feature images;
carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/8 scale feature map so as to extract the feature information of the pathological region under the 1/8 scale;
fitting the characteristic information of the pathological region under the 1/8 scale to a parallax characteristic map corresponding to the 1/8 scale characteristic map to generate a parallax prediction map corresponding to the 1/8 scale characteristic map.
In one embodiment, obtaining feature information of different levels of parallax feature maps corresponding to different scale feature maps includes:
and acquiring the characteristic information of different levels of parallax characteristic images corresponding to the characteristic images of different scales by utilizing hourglass convolution.
In an embodiment, the performing intra-scale aggregation on the third scale feature map to generate a parallax prediction map corresponding to the third scale feature map includes:
acquiring the characteristic information of different layers of the 1/16 scale characteristic map by utilizing hourglass convolution;
carrying out mean pooling and maximum pooling on the feature information of different layers of the 1/16 scale feature map so as to extract the feature information of a disease area under 1/16 scale;
fitting the characteristic information of the pathological region under the 1/16 scale to a parallax characteristic map corresponding to the 1/16 scale characteristic map to generate a parallax prediction map corresponding to the 1/16 scale characteristic map.
In an embodiment, the generating a disparity map according to the disparity prediction map corresponding to the first scale feature map, the disparity prediction map corresponding to the second scale feature map, and the disparity prediction map corresponding to the third scale feature map includes:
and calculating a parallax prediction graph corresponding to the first scale feature graph, a parallax prediction graph corresponding to the second scale feature graph and a parallax prediction graph corresponding to the third scale feature graph by using softmax to generate a parallax graph.
According to a second aspect, an embodiment provides a computer readable storage medium having stored thereon a program executable by a processor to implement an aggregation method of binocular stereo matching as described above.
According to the aggregation method and the computer-readable storage medium for binocular stereo matching of the embodiment, the left view and the right view of the binocular camera are subjected to feature extraction to generate pyramid cost volumes, and the first scale feature map, the second scale feature map and the third scale feature map are determined according to pyramid cost volumes. Inter-scale aggregation is carried out on the first scale feature map and the second scale feature map to obtain a parallax feature map corresponding to the first scale feature map, and inter-scale aggregation is carried out on the second scale feature map and the third scale feature map to obtain a parallax feature map corresponding to the second scale feature map. And then, carrying out intra-scale aggregation on the parallax feature images corresponding to the first scale feature images to obtain parallax prediction images corresponding to the first scale feature images, carrying out intra-scale aggregation on the parallax feature images corresponding to the second scale feature images to obtain parallax prediction images corresponding to the second scale feature images, and carrying out intra-scale aggregation on the third scale feature images to obtain parallax prediction images corresponding to the third scale feature images. And finally, generating a parallax image according to the parallax prediction image corresponding to the first scale feature image, the parallax prediction image corresponding to the second scale feature image and the parallax prediction image corresponding to the third scale feature image. According to the application, inter-scale aggregation is carried out on each scale feature map, so that the detail information and semantic information of each scale feature map can be integrated, the global receptive field can be perceived, and more accurate and complete feature maps can be obtained. And then carrying out intra-scale polymerization on each scale characteristic map, and screening out important pathological areas so as to improve the performance of the whole binocular stereo matching.
Drawings
FIG. 1 is a general flow chart of an aggregation algorithm for binocular stereo matching of one embodiment;
FIG. 2 is a network structure used in a cost aggregation stage of a binocular stereo matching aggregation algorithm according to an embodiment;
FIG. 3 is a network structure diagram of inter-scale aggregation in a cost aggregation phase of one embodiment;
FIG. 4 is a sub-flowchart of step S200 in the binocular stereo matching aggregation algorithm of one embodiment;
FIG. 5 is a sub-flowchart of step S210 in the binocular stereo matching aggregation algorithm of one embodiment;
FIG. 6 is a sub-flowchart of step S220 in the binocular stereo matching aggregation algorithm of one embodiment;
FIG. 7 is a diagram of a network structure for intra-scale aggregation in a cost aggregation phase of one embodiment;
FIG. 8 is a sub-flowchart of step S300 in the aggregation algorithm of binocular stereo matching of one embodiment;
FIG. 9 is a sub-flowchart of step S310 in the binocular stereo matching aggregation algorithm of one embodiment;
FIG. 10 is a sub-flowchart of step S320 in the binocular stereo matching aggregation algorithm of one embodiment;
fig. 11 is a sub-flowchart of step S330 in the aggregation algorithm of binocular stereo matching according to one embodiment.
Detailed Description
The application will be described in further detail below with reference to the drawings by means of specific embodiments. Wherein like elements in different embodiments are numbered alike in association. In the following embodiments, numerous specific details are set forth in order to provide a better understanding of the present application. However, one skilled in the art will readily recognize that some of the features may be omitted, or replaced by other elements, materials, or methods in different situations. In some instances, related operations of the present application have not been shown or described in the specification in order to avoid obscuring the core portions of the present application, and may be unnecessary to persons skilled in the art from a detailed description of the related operations, which may be presented in the description and general knowledge of one skilled in the art.
Furthermore, the described features, operations, or characteristics of the description may be combined in any suitable manner in various embodiments. Also, various steps or acts in the method descriptions may be interchanged or modified in a manner apparent to those of ordinary skill in the art. Thus, the various orders in the description and drawings are for clarity of description of only certain embodiments, and are not meant to be required orders unless otherwise indicated.
The numbering of the components itself, e.g. "first", "second", etc., is used herein merely to distinguish between the described objects and does not have any sequential or technical meaning.
The application provides a binocular stereo matching aggregation algorithm which comprises four parts of feature extraction, cost roll construction, cost aggregation and parallax refinement. Please refer to fig. 1, which is a flowchart of an aggregation algorithm for binocular stereo matching, which specifically includes the following steps.
In the feature extraction and cost volume construction stage, step S100 is adopted: feature extraction is performed on the left view and the right view to generate a pyramid cost volume.
In some embodiments, feature extraction is performed on left and right views acquired by a binocular stereo matching camera to generate a pyramid cost volume comprising a first resolution cost volume, a second resolution cost volume, and a third resolution cost volume. And determining a first scale feature map according to the first resolution cost volume, determining a second scale feature map according to the second resolution cost volume, and determining a third scale feature map according to the third resolution cost volume.
In some embodiments, the pyramid cost volume includes 1/4, 1/8, and 1/16 resolutions, so that the corresponding generated first scale feature map is a 1/4 scale feature map, the second scale feature map is a 1/8 scale feature map, and the third scale feature map is a 1/16 scale feature map.
The pyramid cost volume in the application refers to that in a convolution neural network, a left view and a right view which are input by convolution check with different scales are used for carrying out multi-layer convolution operation, so that feature images with different depth layers are obtained. And then combining the characteristic groups into a pyramid shape, and performing cost convolution operation to improve the recognition capability and recognition accuracy of the convolutional neural network on the changes of different scales, different angles, different illumination and the like of the target object.
In some embodiments, the feature map is a 1/4 scale feature map, a 1/8 scale feature map, and a 1/16 scale feature map, respectively, that correspond from shallow to deep. And extracting shallow features by using the shallow feature map, wherein the shallow features comprise more pixel information, such as information of some colors, textures, edges and angles of the image. Deep features are extracted by utilizing the deep feature map, and more semantic information is contained.
In the cost aggregation stage, please refer to fig. 2 for the network structure adopted in the cost aggregation stage, inter-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map, and then intra-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map. The specific development is performed below.
Please refer to fig. 3, which is a network structure diagram of inter-scale aggregation, the following steps are adopted to perform inter-scale aggregation.
Step S200: inter-scale aggregation is performed on the first scale feature map, the second scale feature map and the third scale feature map.
In some embodiments, please refer to fig. 4, which is a specific flowchart of step S200, when performing step S200 to inter-scale aggregate the first scale feature map, the second scale feature map, and the third scale feature map, the method includes the following steps.
Step S210: and respectively carrying out rearrangement slicing on the first scale feature map and the second scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the first scale feature map.
In some embodiments, when the first scale feature map and the second scale feature map are respectively rearranged and sliced, at this time, a 1/4 scale feature map corresponding to the first scale feature map is a high scale feature map, and a 1/8 scale feature map corresponding to the second scale feature map is a low scale feature map.
In some embodiments, please refer to fig. 5, which is a specific flowchart of step S210, after performing step S210 to reorder and slice the first scale feature map and the second scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the first scale feature map, which includes the following steps.
Step S211: and respectively carrying out rearrangement slicing on the 1/4 scale characteristic map and the 1/8 scale characteristic map to correspondingly obtain a 1/4 slice characteristic map and a 1/8 slice characteristic map.
The rearrangement slicing operation is to cut and rearrange the high-scale feature images according to a certain rule to obtain lower-scale feature images, and the operation can help the network to learn the multi-scale features better. In the convolutional neural network, feature maps with different scales have different scales and semantic information, so that 1/4-scale feature maps and 1/8-scale feature maps are rearranged and sliced according to a set rule to obtain the scale and semantic information of the 1/4-scale feature maps and the scale and semantic information of the 1/8-scale feature maps respectively.
Step S212: and aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map.
The 1/8 slice feature map is aggregated into a 1/4 slice feature map using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregated feature map. The cross-scale attention mechanism is utilized to integrate the feature map information of different scales, so that a more complete and accurate high-scale feature map is obtained, the recognition and positioning capability of objects of different scales can be improved, and the performance and accuracy of binocular stereo matching are further improved. Meanwhile, a cross-scale attention mechanism is used at a lower scale, so that the memory can be effectively saved. The cross-scale attention mechanism is different from a common multi-scale fusion method, the global receptive field can be perceived, the low-scale feature map is utilized to guide the learning of the high-scale feature map, the calculation complexity is reduced according to the matrix sparsification characteristic, and the features are staggered to construct long-range and short-range features.
Step S213: and re-slicing the 1/4 cross-scale aggregation feature map to obtain a 1/4 cross-scale aggregation slice feature map.
Step S214: and aggregating the 1/4 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/4-scale feature map.
The re-slicing operation results in lower scale feature maps, but these feature maps may lose some significant semantic information. Therefore, the feature map is restored through the self-attention mechanism, namely semantic information in the feature map is re-enhanced and integrated through the self-attention mechanism, so that a more accurate and complete feature map is obtained.
Step S220: and respectively carrying out rearrangement slicing on the second scale feature map and the third scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the second scale feature map.
In some embodiments, when the second scale feature map and the third scale feature map are respectively rearranged and sliced, at this time, a 1/8 scale feature map corresponding to the second scale feature map is a high scale feature map, and a 1/16 scale feature map corresponding to the third scale feature map is a low scale feature map.
In some embodiments, please refer to fig. 6, which is a specific flowchart of step S220, after performing step S220 to reorder and slice the second scale feature map and the third scale feature map, inter-scale aggregation is performed to obtain a parallax feature map corresponding to the second scale feature map, which includes the following steps.
Step S221: and respectively carrying out rearrangement slicing on the 1/8 scale characteristic map and the 1/16 scale characteristic map to correspondingly obtain a 1/8 slice characteristic map and a 1/16 slice characteristic map.
And cutting and rearranging the high-scale feature images according to a certain rule by using a rearrangement slicing operation to obtain lower-scale feature images, wherein the operation can help the network to learn the multi-scale features better. In the convolutional neural network, feature maps with different scales have different scales and semantic information, so that 1/8-scale feature maps and 1/16-scale feature maps are rearranged and sliced according to a set rule to obtain the scale and semantic information of the 1/8-scale feature maps and the scale and semantic information of the 1/16-scale feature maps respectively.
Step S222: and aggregating the 1/8 slice feature map and the 1/16 slice feature map by using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregation feature map.
The 1/16 slice feature map is aggregated into a 1/8 slice feature map using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregated feature map. The cross-scale attention mechanism is utilized to integrate the feature map information of different scales, so that a more complete and accurate high-scale feature map is obtained, the recognition and positioning capability of objects of different scales can be improved, and the performance and accuracy of binocular stereo matching are further improved. Meanwhile, a cross-scale attention mechanism is used at a lower scale, so that the memory can be effectively saved. The cross-scale attention mechanism is different from a common multi-scale fusion method, the global receptive field can be perceived, the low-scale feature map is utilized to guide the learning of the high-scale feature map, the calculation complexity is reduced according to the matrix sparsification characteristic, and the features are staggered to construct long-range and short-range features.
Step S223: and re-slicing the 1/8 cross-scale aggregation feature map to obtain a 1/8 cross-scale aggregation slice feature map.
Step S224: and aggregating the 1/8 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/8-scale feature map.
The re-slicing operation results in lower scale feature maps, but these feature maps may lose some significant semantic information. Therefore, the feature map is restored through the self-attention mechanism, namely semantic information in the feature map is re-enhanced and integrated through the self-attention mechanism, so that a more accurate and complete feature map is obtained.
In some embodiments, the cross-scale attention mechanism and the self-attention mechanism are calculated using the following formulas:
wherein, attention represents the Attention mechanism, Q represents the key feature, K represents the value feature, V represents the query feature, the above three are convolved by input feature 1*1, T represents the matrix transpose, softmax represents the normalized exponential function.
Please refer to fig. 7, which is a diagram of an intra-scale aggregation network structure, and the following steps are adopted to perform the intra-scale aggregation.
Step S300: and carrying out intra-scale polymerization on the inter-scale polymerization results of the first scale feature map, the second scale feature map and the third scale feature map.
In some embodiments, please refer to fig. 8, which is a specific flowchart of step S300, when performing step S300 to perform scale aggregation on the first scale feature map, the second scale feature map, and the third scale feature map, the method includes the following steps.
Step S310: and carrying out intra-scale aggregation on the parallax characteristic images corresponding to the first scale characteristic images to generate parallax prediction images corresponding to the first scale characteristic images.
In some embodiments, the parallax feature map corresponding to the first scale feature map is a parallax feature map corresponding to the 1/4 scale feature map.
In some embodiments, please refer to fig. 9, which is a specific flowchart of step S310, when performing step S310 to intra-scale aggregate the parallax characteristic map corresponding to the first scale characteristic map to generate a parallax prediction map corresponding to the first scale characteristic map, the method includes the following steps.
Step S311: and acquiring the characteristic information of different levels of the parallax characteristic map corresponding to the 1/4 scale characteristic map.
In some embodiments, the feature information of different levels of the parallax feature map corresponding to the 1/4 scale feature map is obtained by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.
Step S312: and carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature images corresponding to the 1/4 scale feature images so as to extract the feature information of the pathological region under the 1/4 scale.
In some embodiments, the hourglass-like convolution may extract high frequency features on different scales, while the pooling of the maxima and the pooling of the averages may screen out which features are valid. The maximum value is selected as representative in the pooling of the maximum value, and the average value is selected as representative in the pooling of the average value, so that the feature dimension is reduced, important information is reserved, the efficiency and the accuracy of the model are improved, and meanwhile the problems of overfitting and the like can be avoided.
Step S313: fitting the characteristic information of the pathological region under the 1/4 scale to the parallax characteristic map corresponding to the 1/4 scale characteristic map to generate a parallax prediction map corresponding to the 1/4 scale characteristic map.
Step S320: and carrying out intra-scale aggregation on the parallax characteristic images corresponding to the second scale characteristic images to generate parallax prediction images corresponding to the second scale characteristic images.
In some embodiments, the parallax feature map corresponding to the second scale feature map is a parallax feature map corresponding to the 1/8 scale feature map.
In some embodiments, please refer to fig. 10, which is a specific flowchart of step S320, when performing step S320 to intra-scale aggregate the parallax characteristic map corresponding to the second scale characteristic map to generate a parallax prediction map corresponding to the second scale characteristic map, the method includes the following steps.
Step S321: and acquiring the characteristic information of different levels of the parallax characteristic map corresponding to the 1/8 scale characteristic map.
In some implementations, the feature information of different levels of the parallax feature map corresponding to the 1/8 scale feature map is acquired by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.
Step S322: and carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature images corresponding to the 1/8 scale feature images so as to extract the feature information of the pathological region under the 1/8 scale.
In some embodiments, the hourglass-like convolution may extract high frequency features on different scales, while the pooling of the maxima and the pooling of the averages may screen out which features are valid. The maximum value is selected as representative in the pooling of the maximum value, and the average value is selected as representative in the pooling of the average value, so that the feature dimension is reduced, important information is reserved, the efficiency and the accuracy of the model are improved, and meanwhile the problems of overfitting and the like can be avoided.
Step S323: fitting the characteristic information of the pathological region under the 1/8 scale to the parallax characteristic map corresponding to the 1/8 scale characteristic map to generate a parallax prediction map corresponding to the 1/8 scale characteristic map.
Step S330: and carrying out intra-scale aggregation on the third-scale feature map to generate a parallax prediction map corresponding to the third-scale feature map.
In some embodiments, the parallax feature map corresponding to the third scale feature map is a parallax feature map corresponding to the 1/16 scale feature map.
In some embodiments, please refer to fig. 11, which is a specific flowchart of step S330, when performing step S310 to intra-scale aggregate the third scale feature map to generate a parallax prediction map corresponding to the third scale feature map, the method includes the following steps.
Step S331: and acquiring the characteristic information of different layers of the 1/16 scale characteristic map by utilizing hourglass convolution.
In some implementations, the feature information of different levels of the parallax feature map corresponding to the 1/16 scale feature map is acquired by utilizing hourglass convolution. The hourglass convolution is utilized to regularize the cost body by adopting a top-down coding and decoding structure, and a gating space attention mechanism is added to adaptively pay attention to important information of different areas, in particular to a pathological area which is easy to make mistakes in the stereo matching process. In intra-scale aggregation, the coding and decoding structure refers to the processes of feature extraction, dimension reduction, feature reconstruction and dimension increase. The top-down coding and decoding structure refers to coding from a high-level feature to a low-level feature and decoding from the low-level feature to the high-level feature. The structure can make the model better utilize the characteristic information of different layers, and the important characteristic information is screened out through a gating mechanism, so that the accuracy of target detection and identification is improved. The gating spatial attention mechanism can help the model to screen out important characteristic information, so that the performance of the model is improved.
Step S332: and carrying out mean value pooling and maximum value pooling on the feature information of different layers of the 1/16 scale feature map so as to extract the feature information of the pathological region under the 1/16 scale.
In some embodiments, the hourglass-like convolution may extract high frequency features on different scales, while the pooling of the maxima and the pooling of the averages may screen out which features are valid. The maximum value is selected as representative in the pooling of the maximum value, and the average value is selected as representative in the pooling of the average value, so that the feature dimension is reduced, important information is reserved, the efficiency and the accuracy of the model are improved, and meanwhile the problems of overfitting and the like can be avoided.
Step S333: fitting the characteristic information of the pathological region under the 1/16 scale to the parallax characteristic map corresponding to the 1/16 scale characteristic map to generate a parallax prediction map corresponding to the 1/16 scale characteristic map.
In the parallax refinement stage, step S400 is employed: generating a parallax map according to the parallax prediction map corresponding to the first scale feature map, the parallax prediction map corresponding to the second scale feature map and the parallax prediction map corresponding to the third scale feature map.
In some embodiments, a disparity prediction map corresponding to the first scale feature map, a disparity prediction map corresponding to the second scale feature map, and a third scale feature map are calculated using softmax to generate a disparity map. The parallax prediction graph corresponding to the first scale feature graph is a parallax prediction graph corresponding to the 1/4 scale feature graph, the parallax prediction graph corresponding to the second scale feature graph is a parallax prediction graph corresponding to the 1/8 scale feature graph, and the parallax prediction graph corresponding to the third scale feature graph is a parallax prediction graph corresponding to the 1/16 scale feature graph.
The disparity map is obtained using the following formula:
wherein,for the disparity map, d is the disparity value in the cost volume, dmax is the upper limit of the disparity range, c d For predicting the maximum disparity of the cost Dmax, σ is the channel summation and softmax represents the normalized exponential function.
Those skilled in the art will appreciate that all or part of the functions of the various methods in the above embodiments may be implemented by hardware, or may be implemented by a computer program. When all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a computer readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic disk, optical disk, hard disk, etc., and the program is executed by a computer to realize the above-mentioned functions. For example, the program is stored in the memory of the device, and when the program in the memory is executed by the processor, all or part of the functions described above can be realized. In addition, when all or part of the functions in the above embodiments are implemented by means of a computer program, the program may be stored in a storage medium such as a server, another computer, a magnetic disk, an optical disk, a flash disk, or a removable hard disk, and the program in the above embodiments may be implemented by downloading or copying the program into a memory of a local device or updating a version of a system of the local device, and when the program in the memory is executed by a processor.
The foregoing description of the application has been presented for purposes of illustration and description, and is not intended to be limiting. Several simple deductions, modifications or substitutions may also be made by a person skilled in the art to which the application pertains, based on the idea of the application.

Claims (8)

1. A binocular stereo matching polymerization method, comprising:
feature extraction is carried out on the left view and the right view to generate a pyramid cost volume; the pyramid cost volume comprises a first resolution cost volume, a second resolution cost volume and a third resolution cost volume; the resolution of the first resolution cost volume is larger than that of the second resolution cost volume, and the resolution of the second resolution cost volume is larger than that of the third resolution cost volume;
determining a 1/4 scale feature map according to the first resolution cost volume, determining a 1/8 scale feature map according to the second resolution cost volume, and determining a 1/16 scale feature map according to the third resolution cost volume;
respectively carrying out rearrangement slicing on the 1/4 scale feature map and the 1/8 scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the 1/4 scale feature map;
the 1/4 scale feature map is rearranged and sliced to obtain a 1/4 slice feature map;
re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;
aggregating the 1/4 slice feature map and the 1/8 slice feature map by using a cross-scale attention mechanism to obtain a 1/4 cross-scale aggregation feature map; the cross-scale attention mechanism is calculated according to key features, value features and query features, and the key features, the value features and the query features are determined according to 1*1 convolution;
re-slicing the 1/4 cross-scale aggregation feature map to obtain a 1/4 cross-scale aggregation slice feature map;
aggregating the 1/4 trans-scale aggregation slice feature images by using a self-attention mechanism to obtain parallax feature images corresponding to the 1/4-scale feature images;
respectively carrying out rearrangement slicing on the 1/8-scale feature map and the 1/16-scale feature map, and then carrying out inter-scale aggregation to obtain a parallax feature map corresponding to the 1/8-scale feature map;
performing intra-scale aggregation on the parallax feature map corresponding to the 1/4-scale feature map to generate a parallax prediction map corresponding to the 1/4-scale feature map;
performing intra-scale aggregation on the parallax feature map corresponding to the 1/8-scale feature map to generate a parallax prediction map corresponding to the 1/8-scale feature map;
performing intra-scale aggregation on the 1/16-scale feature map to generate a parallax prediction map corresponding to the 1/16-scale feature map;
generating a parallax image according to the parallax prediction image corresponding to the 1/4 scale feature image, the parallax prediction image corresponding to the 1/8 scale feature image and the parallax prediction image corresponding to the 1/16 scale feature image.
2. The method for polymerizing binocular stereo matching according to claim 1, wherein the step of performing inter-scale polymerization to obtain a parallax feature map corresponding to the 1/8-scale feature map after performing rearrangement slicing on the 1/8-scale feature map and the 1/16-scale feature map, respectively, comprises:
re-slicing the 1/8-scale feature map to obtain a 1/8-slice feature map;
re-slicing the 1/16 scale feature map to obtain a 1/16 slice feature map;
aggregating the 1/8 slice feature map and the 1/16 slice feature map by using a cross-scale attention mechanism to obtain a 1/8 cross-scale aggregation feature map;
re-slicing the 1/8 cross-scale aggregation feature map to obtain a 1/8 cross-scale aggregation slice feature map;
and aggregating the 1/8 cross-scale aggregation slice feature map by using a self-attention mechanism to obtain a parallax feature map corresponding to the 1/8-scale feature map.
3. The method for aggregating binocular stereo matching of claim 1, wherein the intra-scale aggregating the disparity feature maps corresponding to the 1/4-scale feature maps to generate the disparity prediction map corresponding to the 1/4-scale feature maps comprises
Acquiring the feature information of different levels of parallax feature images corresponding to the 1/4 scale feature images;
carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/4 scale feature map so as to extract the feature information of the pathological region under the 1/4 scale;
fitting the characteristic information of the 1/4-scale pathological region to a parallax characteristic map corresponding to the 1/4-scale characteristic map to generate a parallax prediction map corresponding to the 1/4-scale characteristic map.
4. The method for aggregating binocular stereo matching according to claim 2, wherein the intra-scale aggregating the disparity feature map corresponding to the 1/8-scale feature map to generate the disparity prediction map corresponding to the 1/8-scale feature map comprises:
acquiring the feature information of different levels of parallax feature images corresponding to the 1/8 scale feature images;
carrying out mean pooling and maximum pooling on the feature information of different levels of the parallax feature map corresponding to the 1/8 scale feature map so as to extract the feature information of the pathological region under the 1/8 scale;
fitting the characteristic information of the pathological region under the 1/8 scale to a parallax characteristic map corresponding to the 1/8 scale characteristic map to generate a parallax prediction map corresponding to the 1/8 scale characteristic map.
5. The method for aggregation of binocular stereo matching according to any one of claims 3 to 4, wherein obtaining feature information of different levels of parallax feature maps corresponding to feature maps of different scales comprises:
and acquiring the characteristic information of different levels of parallax characteristic images corresponding to the characteristic images of different scales by utilizing hourglass convolution.
6. The method for aggregating binocular stereo matching according to claim 1, wherein the performing intra-scale aggregation on the 1/16-scale feature map to generate a disparity prediction map corresponding to the 1/16-scale feature map includes:
acquiring the characteristic information of different layers of the 1/16 scale characteristic map by utilizing hourglass convolution;
carrying out mean pooling and maximum pooling on the feature information of different layers of the 1/16 scale feature map so as to extract the feature information of a disease area under 1/16 scale;
fitting the characteristic information of the pathological region under the 1/16 scale to a parallax characteristic map corresponding to the 1/16 scale characteristic map to generate a parallax prediction map corresponding to the 1/16 scale characteristic map.
7. The method for aggregation of binocular stereo matching according to claim 1, wherein the generating a disparity map from the disparity prediction map corresponding to the 1/4 scale feature map, the disparity prediction map corresponding to the 1/8 scale feature map, and the disparity prediction map corresponding to the 1/16 scale feature map comprises:
and calculating the parallax prediction graph corresponding to the 1/4 scale feature graph, the parallax prediction graph corresponding to the 1/8 scale feature graph and the parallax prediction graph corresponding to the 1/16 scale feature graph by using softmax to generate a parallax graph.
8. A computer readable storage medium, characterized in that the medium has stored thereon a program executable by a processor to implement the method of any of claims 1-7.
CN202311013015.7A 2023-08-14 2023-08-14 Binocular stereo matching aggregation method Active CN116740161B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311013015.7A CN116740161B (en) 2023-08-14 2023-08-14 Binocular stereo matching aggregation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311013015.7A CN116740161B (en) 2023-08-14 2023-08-14 Binocular stereo matching aggregation method

Publications (2)

Publication Number Publication Date
CN116740161A CN116740161A (en) 2023-09-12
CN116740161B true CN116740161B (en) 2023-11-28

Family

ID=87904710

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311013015.7A Active CN116740161B (en) 2023-08-14 2023-08-14 Binocular stereo matching aggregation method

Country Status (1)

Country Link
CN (1) CN116740161B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117475182A (en) * 2023-09-13 2024-01-30 江南大学 Stereo matching method based on multi-feature aggregation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160149088A (en) * 2015-06-17 2016-12-27 한국전자통신연구원 Method and apparatus for detecting disparity by using hierarchical stereo matching
CN111508013A (en) * 2020-04-21 2020-08-07 中国科学技术大学 Stereo matching method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20160149088A (en) * 2015-06-17 2016-12-27 한국전자통신연구원 Method and apparatus for detecting disparity by using hierarchical stereo matching
CN111508013A (en) * 2020-04-21 2020-08-07 中国科学技术大学 Stereo matching method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Cross-scale cost aggregation for stereo matching;ZHANG K et al;《Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014)》;第1590-1597页 *
基于跨尺度特征融合自注意力的图像描述方法;王鸣展 等;《计算机科学》;第49卷(第10期);第191-197页 *

Also Published As

Publication number Publication date
CN116740161A (en) 2023-09-12

Similar Documents

Publication Publication Date Title
CN110110617B (en) Medical image segmentation method and device, electronic equipment and storage medium
US11200424B2 (en) Space-time memory network for locating target object in video content
CN111311592A (en) Three-dimensional medical image automatic segmentation method based on deep learning
CN111369581B (en) Image processing method, device, equipment and storage medium
CN112017189A (en) Image segmentation method and device, computer equipment and storage medium
CN110675423A (en) Unmanned aerial vehicle tracking method based on twin neural network and attention model
CN108764039B (en) Neural network, building extraction method of remote sensing image, medium and computing equipment
CN116740161B (en) Binocular stereo matching aggregation method
US9330336B2 (en) Systems, methods, and media for on-line boosting of a classifier
CN111627024A (en) U-net improved kidney tumor segmentation method
CN111291825A (en) Focus classification model training method and device, computer equipment and storage medium
EP3836083B1 (en) Disparity estimation system and method, electronic device and computer program product
Choe et al. Urban structure classification using the 3D normal distribution transform for practical robot applications
US20230274400A1 (en) Automatically removing moving objects from video streams
CN112348819A (en) Model training method, image processing and registering method, and related device and equipment
CN114372523A (en) Binocular matching uncertainty estimation method based on evidence deep learning
CN115375692A (en) Workpiece surface defect segmentation method, device and equipment based on boundary guidance
CN115222954A (en) Weak perception target detection method and related equipment
CN110009641A (en) Crystalline lens dividing method, device and storage medium
CN113902802A (en) Visual positioning method and related device, electronic equipment and storage medium
CN113780389A (en) Deep learning semi-supervised dense matching method and system based on consistency constraint
CN116109822A (en) Organ image segmentation method and system based on multi-scale multi-view network
CN116071557A (en) Long tail target detection method, computer readable storage medium and driving device
CN116883770A (en) Training method and device of depth estimation model, electronic equipment and storage medium
CN113468931B (en) Data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant