WO2023159757A1

WO2023159757A1 - Disparity map generation method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023159757A1
Application number: PCT/CN2022/090665
Authority: WO
Inventors: 唐小初; 张祎頔; 舒畅; 陈又新
Original assignee: 平安科技（深圳）有限公司
Priority date: 2022-02-22
Filing date: 2022-04-29
Publication date: 2023-08-31
Also published as: CN114519710A

Abstract

Embodiments of the present application relate to the technical field of artificial intelligence, and provide a disparity map generation method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring a target image, the target image comprising a left view and a right view; performing feature extraction on the left view to obtain a plurality of left view features, and performing feature extraction on the right view to obtain a plurality of right view features; performing image segmentation processing on the left view features to obtain a first image feature; performing combination processing on the left view features, the first image feature, and the right view features to obtain a target cost volume; performing disparity estimation on the target cost volume by means of a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map; and performing semantic refinement processing on the estimated disparity map by means of a preset semantic refinement network and the first image feature to obtain a target disparity map. According to the embodiments of the present application, the accuracy of disparity estimation can be improved, and the error of the target disparity map is reduced.

Description

Disparity map generation method and device, electronic device and storage medium

This application claims the priority of the Chinese patent application with the application number 202210162805.0 filed on February 22, 2022, and the title of the invention is "Disparity map generation method and device, electronic equipment and storage medium", the entire content of which is incorporated by reference incorporated in this application.

technical field

The present application relates to the technical field of artificial intelligence, and in particular to a method and device for generating a disparity map, electronic equipment, and a storage medium.

Background technique

Disparity estimation is a fundamental computer vision problem that aims to predict distance measurements for each point in a target scene.

technical problem

The following are the technical problems of the prior art that the inventors are aware of:

The current stereo matching algorithm usually encounters difficulties in ill-posed regions such as weak textures, repeated textures, and occlusions when performing disparity estimation, and often cannot accurately estimate the disparity of the target object, resulting in a large error in the generated disparity map. Therefore, how to improve the accuracy of the disparity estimation and reduce the error of the disparity map has become an urgent technical problem to be solved.

technical solution

In the first aspect, the embodiment of the present application proposes a method for generating a disparity map, the method including:

Acquiring a target image, wherein the target image includes a left view and a right view;

performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features;

performing image segmentation processing on the left view feature to obtain a first image feature;

combining the left view feature, the first image feature and the right view feature to obtain a target cost body;

Performing disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

Semantic refinement is performed on the estimated disparity map by using a preset semantic refinement network and the first image feature to obtain a target disparity map.

In the second aspect, the embodiment of the present application proposes a device for generating a disparity map, and the device includes:

An image acquisition module, configured to acquire a target image, wherein the target image includes a left view and a right view of the target object;

A feature extraction module, configured to perform feature extraction on the left view to obtain multiple left view features, and perform feature extraction on the right view to obtain multiple right view features;

An image segmentation module, configured to perform image segmentation processing on the left view feature to obtain the first image feature;

A fusion module, configured to perform fusion processing on the left view feature, the first image feature, and the right view feature to obtain a target cost volume;

A disparity estimation module, configured to perform disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

The semantic refinement module is configured to perform semantic refinement processing on the estimated disparity map through a preset semantic refinement network and the first image feature, to obtain a target disparity map.

In the third aspect, the embodiment of the present application provides an electronic device, the electronic device includes a memory, a processor, a program stored in the memory and operable on the processor, and a program for implementing the processor A data bus connecting and communicating with the memory, when the program is executed by the processor, a method for generating a disparity map is implemented, wherein the method for generating a disparity map includes: acquiring a target image, wherein the target The image includes a left view and a right view; feature extraction is performed on the left view to obtain multiple left view features, and feature extraction is performed on the right view to obtain multiple right view features; image segmentation is performed on the left view features processing to obtain the first image feature; combining the left view feature, the first image feature and the right view feature to obtain the target cost volume; Perform disparity estimation on the volume to obtain an estimated disparity map; perform semantic refinement processing on the estimated disparity map through a preset semantic refinement network and the first image feature to obtain a target disparity map.

In a fourth aspect, the embodiment of the present application provides a storage medium, the storage medium is a computer-readable storage medium for computer-readable storage, the storage medium stores one or more programs, and the one or more This program can be executed by one or more processors to implement a method for generating a disparity map, wherein the method for generating a disparity map includes: acquiring a target image, wherein the target image includes a left view and a right view; Perform feature extraction on the left view to obtain multiple left view features, and perform feature extraction on the right view to obtain multiple right view features; perform image segmentation processing on the left view features to obtain first image features; Combining the left view feature, the first image feature and the right view feature to obtain a target cost volume; performing disparity estimation on the target cost volume through a preset three-dimensional convolution hourglass model to obtain an estimated disparity map; Semantic refinement is performed on the estimated disparity map by using a preset semantic refinement network and the first image feature to obtain a target disparity map.

Beneficial effect

The disparity map generation method and device, electronic equipment and storage medium proposed in this application can make the obtained left view features and right view features more meet the requirements of disparity estimation. By using the preset three-dimensional convolutional hourglass model to estimate the disparity of the target cost volume, the semantic information can be used to assist the disparity estimation and improve the reliability of the disparity estimation. Semantic refinement of the estimated disparity map through the preset semantic refinement network and the first image features can enhance the understanding of the scene for the stereo matching task, improve the accuracy of the disparity estimation, and reduce the error of the disparity map.

Description of drawings

The accompanying drawings are used to provide a further understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

FIG. 1 is a flow chart of a method for generating a disparity map provided in an embodiment of the present application;

Fig. 2 is the flowchart of step S102 in Fig. 1;

Fig. 3 is the flowchart of step S103 in Fig. 1;

Fig. 4 is the flowchart of step S104 in Fig. 1;

Fig. 5 is the flowchart of step S402 in Fig. 4;

Fig. 6 is the flowchart of step S105 in Fig. 1;

Fig. 7 is the flowchart of step S106 in Fig. 1;

FIG. 8 is a schematic structural diagram of a disparity map generation device provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a hardware structure of an electronic device provided by an embodiment of the present application.

Embodiments of the present invention

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, not to limit the present application.

It should be noted that although the functional modules are divided in the schematic diagram of the device, and the logical sequence is shown in the flowchart, in some cases, it can be executed in a different order than the module division in the device or the flowchart in the flowchart. steps shown or described. The terms "first", "second" and the like in the specification and claims and the above drawings are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiment of the application, and are not intended to limit the application.

At present, stereo matching algorithms usually encounter difficulties in ill-posed regions such as weak textures, repeated textures, and occlusions when performing disparity estimation, and often cannot accurately perform disparity estimation on target objects. Therefore, how to improve the accuracy of disparity estimation has become an urgent technical problem to be solved.

Based on this, embodiments of the present application provide a method and device for generating a disparity map, an electronic device, and a storage medium, aiming at improving the accuracy of disparity estimation and reducing errors of the disparity map.

The disparity map generation method and device, electronic device, and storage medium provided in the embodiments of the present application are specifically described through the following embodiments. First, the method for generating a disparity map in the embodiment of the present application is described.

The embodiments of the present application may acquire and process relevant data based on artificial intelligence technology. Among them, artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

Artificial intelligence basic technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, robotics technology, biometrics technology, speech processing technology, natural language processing technology, and machine learning/deep learning.

The disparity map generation method provided in the embodiment of the present application relates to the technical field of artificial intelligence. The method for generating a disparity map provided in the embodiment of the present application may be applied to a terminal, may also be applied to a server, and may also be software running on the terminal or the server. In some embodiments, the terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer, etc.; the server end can be configured as an independent physical server, or can be configured as a server cluster or a distributed system composed of multiple physical servers, or It can be configured as a cloud that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server; the software may be an application for realizing the disparity map generation method, etc., but is not limited to the above forms.

The application can be used in numerous general purpose or special purpose computer system environments or configurations. Examples: personal computers, server computers, handheld or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, etc. This application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.

Fig. 1 is an optional flow chart of a method for generating a disparity map provided by an embodiment of the present application. The method in Fig. 1 may include, but is not limited to, step S101 to step S106.

Step S101, acquiring a target image, wherein the target image includes a left view and a right view;

Step S102, performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features;

Step S103, performing image segmentation processing on the left view feature to obtain the first image feature;

Step S104, combining the left view feature, the first image feature and the right view feature to obtain the target cost body;

Step S105, performing disparity estimation on the target cost volume through the preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

Step S106, perform semantic refinement processing on the estimated disparity map through the preset semantic refinement network and the first image features, to obtain the target disparity map.

From step S101 to step S106 of the embodiment of the present application, by performing feature extraction on the left view, multiple left view features are obtained, and by performing feature extraction on the right view, multiple right view features are obtained, which can make the obtained left view features and right View features are more in line with the needs of disparity estimation. The first image feature is obtained by performing image segmentation processing on the left view feature, and the left view feature, the first image feature and the right view feature are combined to obtain the target cost body, and the target is calculated by the preset three-dimensional convolutional hourglass model. The cost body performs disparity estimation to obtain an estimated disparity map, which can use semantic information to assist disparity estimation and improve the reliability of disparity estimation. Finally, the estimated disparity map is semantically refined through the preset semantic refinement network and the first image features to obtain the target disparity map, which can enhance the understanding of the scene for the stereo matching task, improve the accuracy of the disparity estimation, and reduce the generated The error of the target disparity map of . In step S101 of some embodiments, the target image can be a two-dimensional image or a three-dimensional image; in some embodiments, the target image can be obtained by computer tomography (Computed Tomo-graphy, -CT), in another item In an embodiment, the target image can also be obtained by magnetic resonance imaging (Magnetic Resonance Imaging, MRI). In some other embodiments, the target image can also be obtained by shooting with a binocular camera, etc., but is not limited thereto. The left view and the right view are left and right views captured by a binocular camera.

Before step S102 in some embodiments, the disparity map generation method further includes pre-constructing a stereo matching network, the stereo matching network mainly includes a feature extraction module, an image segmentation module, a disparity estimation module and a semantic refinement module, wherein the feature extraction module mainly It is composed of a residual network for feature extraction of the input target image; the image segmentation module is mainly composed of a PSPNet decoding network for sampling the target image after feature extraction; the disparity estimation module is mainly composed of a three-dimensional convolutional network , which is used to estimate the disparity of the target image after sampling processing, and generate an estimated disparity map; the semantic refinement module is mainly composed of a semantic refinement network, which mainly includes a convolutional layer and a fully connected layer, which is used to estimate The disparity map is semantically refined to generate the target disparity map.

Referring to FIG. 2, in some embodiments, the feature extraction module includes a residual network and a pooling layer, and step S102 may include, but is not limited to, step S201 to step S202:

Step S201, performing convolution processing on the left view to obtain the convolution feature of the left view, and performing convolution processing on the right view to obtain the convolution feature of the right view;

Step S202, according to the preset multi-scale feature resolution parameters, perform pyramid pooling processing on the left view convolution features to obtain multiple left view features, and perform pyramid pooling on the right view convolution features according to the multi-scale feature resolution parameters Pooling processing to obtain multiple right-view features.

In step S201 of some embodiments, feature extraction is performed on the left view and the right view through the pre-residual network of the feature extraction module in the stereo matching network, specifically, the residual network is composed of a plurality of residual dense blocks, The left view and the right view are respectively convoluted through the convolution layers of different residual dense blocks, and the left view convolution feature and the right view convolution feature are obtained.

In step S202 of some embodiments, the left view convolution feature and the right view convolution feature are input to the pooling layer of the feature extraction module, and the left view convolution feature is respectively used by the multi-scale feature resolution parameters of the pooling layer , Right view convolution features are processed by pyramid pooling, and the multi-scale features of the left view and the multi-scale features of the right view can be obtained through the pyramid pooling process.

For example, according to the preset multi-scale feature resolution parameters, pyramid pooling is performed on the left view convolution features, so that the resolutions of the obtained left view features are 1/4 and 1/3 of the original left view resolution respectively. 8 and 1/16; According to the preset multi-scale feature resolution parameters, the pyramid pooling process is performed on the right view convolution features, so that the resolution of the obtained multiple right view features is 1/1 of the original right view resolution. 4, 1/8 and 1/16. This method can fully combine view feature information at different scales. Under multiple scales, low-level features can have high resolution, and high-level features can contain richer semantic information, thereby improving the accuracy of view estimation.

Please refer to FIG. 3 , in some embodiments, the image segmentation module includes a decoding layer and a convolutional layer, and step S103 may include but is not limited to include steps S301 to S303:

Step S301, perform up-sampling processing on the left-view features through the preset bilinear peak interpolation method to obtain the first-view feature hidden variables;

Step S302, performing feature sorting on the feature latent variables of the first view through the preset first function to obtain the feature sequence of the first view;

Step S303, performing convolution processing on the first view feature sequence to obtain the first image feature.

In step S301 of some embodiments, the preset bilinear peak interpolation method mainly uses the pixel values of four adjacent points, assigns different weights according to their distances from the interpolation point, and performs linear interpolation through the bilinear peak interpolation method. The linear peak interpolation method can upsample the left view features, and upsample the left view features of different scales to a quarter of the original resolution through bilinear interpolation. This method can achieve the average of the left view Low-pass filtering smoothes the edge of the left view, thereby producing a relatively coherent output image, and can also improve the accuracy of the first view feature hidden variable.

For example, when performing bilinear peak interpolation calculations, you can take 4 adjacent points around the (x, y) point on the left view, interpolate twice in the y direction (or x direction), and then in the x direction (or y direction) direction) to interpolate once to obtain the value f(x, y) of point (x, y). Let the 4 adjacent points be (i, j), (i, j+1), (i+1, j), (i+1, j+1), i represents the number of rows with the origin in the upper left corner, and j Represents the number of columns. Assume α=x-i, β=y-j, draw a straight line through (x, y) parallel to the x-axis, intersect with the side formed by 4 adjacent points at point (i, y) and point (i+1, y). Interpolate in the y direction first, and calculate the intersection values f(i, y) and f(i+1, y). f(i,y) is calculated by interpolating f(i,j+1) and f(i,j).

In step S302 of some embodiments, the first function is a concat function, and the first view feature sequence is obtained by sequentially connecting the first view feature latent variables through the concat function.

In step S303 of some embodiments, convolution processing is performed on the feature sequence of the first view through a convolution layer to obtain multiple first image features of different scales.

Referring to FIG. 4, in some embodiments, step S104 may include but not limited to include steps S401 to S402:

Step S401, according to the preset multi-scale feature resolution parameters, classify and combine the left-view features and right-view features to obtain an initial cost body;

In step S402, the initial cost volume and the first image feature are concatenated through the preset three-dimensional convolutional network to obtain the target cost volume.

In step S401 of some embodiments, the preset multi-scale feature resolution parameters may be 4, 8, 6, etc., which may be set according to actual conditions, but are not limited thereto. According to the different resolution parameters of multi-scale features, multiple left-view features and multiple right-view features are classified and combined. For example, the left-view features and right-view features with multi-scale feature resolution parameters of 4 are added vectorially. , to obtain the view features with a multi-scale feature resolution parameter of 4. Among them, the cost body size of the initial cost body can be expressed as

H and W are the image dimensions of the target image, where H is the height of the target image, W is the width of the target image, D is the disparity search range, C is the number of feature channels, s is the downsampling rate, s=4,8, 6.

It should be noted that the cost volume is a low-cost resolution cost volume constructed on different scales, which refers to the intermediate result obtained in the process of image stitching. Specifically, since most of the stereo matching process is binocular stereo matching, the input of the stereo matching network is usually two images, namely the left view and the right view. When splicing the left view and the right view, the stereo matching network A maximum parallax will be initialized. For example, if the maximum parallax is 5, five stitching operations of different scales will be performed on the left view and the right view. The parallax values corresponding to these five stitching operations are equal to 0, 1, 2, 3, 4. When the disparity value is 0, the left view and the right view are directly spliced; when the disparity value is 1, the left view and the right view are misaligned by 1 pixel; When the dislocation of 2 pixels with the right view is spliced, when the parallax value is 3, the dislocation of 3 pixels between the left view and the right view is spliced; when the disparity value is 4, the dislocation of 4 pixels between the left view and the right view is performed stitching. The tensor size of the original left view and right view is W*H*3, where W refers to the image width, H is the image height, and 3 is the number of channels. The tensor size of the left view and the right view is three-dimensional, and splicing The tensor size of the obtained target view is W*H*3*5, the target view is the cost body, and the tensor size of the target view is four-dimensional. In short, according to the preset maximum parallax, the input images are spliced at different scales, and the intermediate product obtained is the cost body. Further, the cost body is input to the stereo matching network for each pixel point matching, and the fused cost body can be obtained. At the same time, the maximum parallax parameter in the tensor size of the cost body can be removed through the stereo matching network (for example, the maximum parallax value 5 ), so that the tensor size of the output image is still three-dimensional, that is, the tensor size of the output image is W*H*3.

Referring to FIG. 5, in some embodiments, step S402 may also include but not limited to include steps S501 to S503:

Step S501, regularize the initial cost body through the three-dimensional convolution network to obtain the first intermediate cost body, and perform regularization processing on the first image features through the three-dimensional convolution network to obtain the first intermediate image features;

Step S502, performing down-sampling processing on the first intermediate cost body through a three-dimensional convolutional network to obtain a second intermediate cost body, and performing up-sampling processing on the first intermediate image features to obtain second intermediate image features;

Step S503, splicing the second intermediate cost volume and the second intermediate image features through the three-dimensional convolutional network to obtain the target cost volume.

In step S501 of some embodiments, the initial cost body is regularized through the three-dimensional convolution of the three-dimensional convolutional network to obtain the first intermediate cost body A, including a feature resolution of

The first intermediate cost body A1, the feature resolution is

The first intermediate cost body A2, the feature resolution is

The first intermediate cost body A3 of , and adjust the number of channels of the first image feature whose multi-scale feature resolution parameter is 16 to

And the first semantic feature is upgraded from two-dimensional to four-dimensional, and the four-dimensional first image feature is regularized through the three-dimensional convolution of the three-dimensional convolutional network, and the feature resolution is obtained as

The first intermediate image feature B3, through the same operation, the feature resolution is obtained as

The first intermediate image feature B2 of the feature resolution is

The first intermediate image feature B1 of .

In step S502 and step S503 of some embodiments, the feature resolution is

The first intermediate cost body A1 of is down-sampled, and the first intermediate cost body A1 after the down-sampling process and the feature resolution are

The first intermediate cost body A2 is connected to obtain the second intermediate cost body, and the number of channels is adjusted through three-dimensional convolution. Then, the second intermediate cost body is processed by using the three-dimensional convolution with a step size of 2 in the three-dimensional convolutional network. Sampling processing, the second intermediate cost body and feature resolution after downsampling processing are

The first intermediate cost body A3 is connected to obtain the target view cost body.

Similarly, the feature resolution is

The first intermediate image feature B3 with feature resolution is

The first intermediate cost body A3 is connected, and the feature resolution is obtained as

The second intermediate image features of , and through three-dimensional deconvolution, the feature resolution is

The second intermediate image feature C3 is upsampled, and the feature resolution is obtained as

The second intermediate image feature C2 of , the feature resolution is adjusted by three-dimensional convolution as

The number of channels of the second intermediate image feature C2.

Through the above operations, the feature resolution is

The first intermediate image feature B2 with feature resolution is

The first intermediate cost body A2 is connected, and the feature resolution is obtained as

The second intermediate image feature C1 of , will feature resolution as

The first intermediate image feature B1 with feature resolution is

The first intermediate cost body A1 is connected, and the feature resolution is obtained as

The second intermediate image feature C1.

Finally, the feature resolution is

The second intermediate image feature C1 and the target view cost volume are concatenated to obtain the target cost volume.

Please refer to FIG. 6. In some embodiments, the three-dimensional convolutional hourglass model includes an aggregation layer and a prediction layer. Step S105 includes but is not limited to steps S601 to S602:

Step S601, performing cost aggregation processing on the target cost body through the aggregation layer to obtain the fusion cost body;

Step S602, performing disparity estimation on the fused cost volume through the second function of the prediction layer to obtain an estimated disparity map.

In step S601 of some embodiments, the three-dimensional convolutional hourglass model includes two stacked aggregation layers, the aggregation layer structure is the same as the structure of the above-mentioned three-dimensional convolutional network, and the target cost body is respectively input into the two aggregation layers, The target cost body is aggregated through each aggregation layer, and then the outputs of the two aggregation layers are fused to obtain the final fused cost body.

In step S602 of some embodiments, the second function is a soft argmin function, and the soft argmin function can be used to perform more accurate disparity estimation on the fusion cost volume obtained through aggregation, and obtain an estimated disparity map.

Please refer to FIG. 7 , in some embodiments, the semantic refinement network includes a convolutional layer and a fully connected layer, and step S106 may also include, but is not limited to, step S701 to step S704:

Step S701, performing probability calculation on the first image feature through the third function of the semantic refinement network to generate a semantic probability map;

Step S702, performing convolution processing on the estimated disparity map through the semantic refinement network to obtain the estimated disparity feature;

Step S703, performing fusion processing on the semantic probability map and the estimated disparity feature through the semantic refinement network to obtain the preliminary disparity feature;

In step S704, the preliminary disparity feature is decoded through the semantic refinement network to obtain the target disparity map.

In step S701 of some embodiments, a third function is preset on the fully connected layer of the semantic refinement network, the third function is a softmax function, and the probability calculation of the first image feature is performed through the softmax function, and according to the calculation result, the softmax function A probability distribution is created on the preset semantic category labels, and the semantic possibility of the first image feature on different semantic category labels is reflected through the semantic probability map.

In step S702 of some embodiments, two-dimensional convolution processing is performed on the estimated disparity map through the convolution layer of the semantic refinement network to capture image features of the estimated disparity map to obtain estimated disparity features.

In step S703 of some embodiments, through the convolution layer of the semantic refinement network, the semantic probability map and the estimated disparity feature are vector multiplied according to the preset weight ratio, so as to realize the feature fusion of the semantic feature and the estimated disparity feature, Get preliminary disparity features that get semantically weighted.

In step S704 of some embodiments, convolutional decoding and deconvolution upsampling are performed on the preliminary disparity feature through the convolutional layer of the semantic refinement network to obtain a target disparity map, which is used to reflect the resolution of the target image rate parallax.

Through the above steps S701 to S704, the parallax map generation method uses the image segmentation results to weight the semantic categories of the parallax estimation results, and then encodes and decodes, so as to improve the scene semantic reliability of the estimated parallax and enhance the understanding of the scene for the stereo matching task. , using the semantic information of the scene can improve the effect of disparity estimation in inappropriate regions, thereby improving the accuracy of disparity estimation and reducing the error of the disparity map.

In this embodiment of the present application, a target image is acquired, where the target image includes a left view and a right view. Furthermore, performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features can make the obtained left view features and right view features more in line with the requirements of disparity estimation. Furthermore, image segmentation is performed on the left view feature to obtain the first image feature, and the left view feature, the first image feature and the right view feature are combined to obtain the target cost volume, and the preset three-dimensional convolutional hourglass model is used to The target cost volume performs disparity estimation to obtain an estimated disparity map. In this way, semantic information can be used to assist disparity estimation and improve the reliability of disparity estimation. Finally, the estimated disparity map is semantically refined through the preset semantic refinement network and the first image features to obtain the target disparity map, which can enhance the understanding of the scene for the stereo matching task, improve the accuracy of the disparity estimation, and reduce the disparity Figure error.

Please refer to FIG. 8, the embodiment of the present application also provides a disparity map generation device, which can implement the above disparity map generation method, and the device includes:

An image acquisition module 801, configured to acquire a target image, wherein the target image includes a left view and a right view of the target object;

The feature extraction module 802 is used to perform feature extraction on the left view to obtain multiple left view features, and perform feature extraction on the right view to obtain multiple right view features;

The image segmentation module 803 is configured to perform image segmentation processing on the left view feature to obtain the first image feature;

The fusion module 804 is used to perform fusion processing on the left view feature, the first image feature and the right view feature to obtain the target cost body;

A disparity estimation module 805, configured to perform disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

The semantic refinement module 806 is configured to perform semantic refinement processing on the estimated disparity map through the preset semantic refinement network and the first image features, to obtain the target disparity map.

The specific implementation manner of the disparity map generation device is basically the same as the specific embodiment of the above-mentioned disparity map generation method, and will not be repeated here.

The embodiment of the present application also provides an electronic device, the electronic device includes: a memory, a processor, a program stored in the memory and operable on the processor, and a data bus for realizing connection and communication between the processor and the memory , when the program is executed by the processor, the above disparity map generation method is implemented. The electronic device may be any intelligent terminal including a tablet computer, a vehicle-mounted computer, and the like.

Please refer to FIG. 9. FIG. 9 illustrates a hardware structure of an electronic device in another embodiment. The electronic device includes:

The processor 901 may be implemented by a general-purpose CPU (Central Processing Unit, central processing unit), a microprocessor, an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, and is used to execute related programs, so as to realize The technical solutions provided by the embodiments of the present application;

The memory 902 may be implemented in the form of a read-only memory (ReadOnlyMemory, ROM), a static storage device, a dynamic storage device, or a random access memory (RandomAccessMemory, RAM). The memory 902 can store operating systems and other application programs. When implementing the technical solutions provided by the embodiments of this specification through software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute a parallax A graph generation method, wherein the disparity map generation method includes: acquiring a target image, wherein the target image includes a left view and a right view; performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain A plurality of right-view features; perform image segmentation processing on the left-view features to obtain the first image features; combine the left-view features, the first image features and the right-view features to obtain the target cost body; through the preset three-dimensional convolution The hourglass model estimates the disparity of the target cost body to obtain the estimated disparity map; through the preset semantic refinement network and the first image feature, the estimated disparity map is semantically refined to obtain the target disparity map;

The input/output interface 903 is used to realize information input and output;

The communication interface 904 is used to realize the communication interaction between the device and other devices, and the communication can be realized through a wired method (such as USB, network cable, etc.), or can be realized through a wireless method (such as a mobile network, WIFI, Bluetooth, etc.);

bus 905, for transferring information between various components of the device (such as processor 901, memory 902, input/output interface 903 and communication interface 904);

The processor 901 , the memory 902 , the input/output interface 903 and the communication interface 904 are connected to each other within the device through the bus 905 .

An embodiment of the present application also provides a storage medium, which is a computer-readable storage medium for computer-readable storage. The computer-readable storage medium may be non-volatile or volatile. The storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement a method for generating a disparity map, wherein the method for generating a disparity map includes: acquiring a target image, wherein the target image Including the left view and the right view; performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features; performing image segmentation processing on the left view features to obtain the first image feature ; The left view feature, the first image feature and the right view feature are combined to obtain the target cost volume; the disparity estimation is performed on the target cost volume through the preset three-dimensional convolution hourglass model, and the estimated disparity map is obtained; through the preset semantics The refinement network and the first image features perform semantic refinement on the estimated disparity map to obtain the target disparity map.

As a non-transitory computer-readable storage medium, memory can be used to store non-transitory software programs and non-transitory computer-executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage devices. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor via a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The disparity map generation method, disparity map generation device, electronic device, and storage medium provided in the embodiments of the present application obtain a target image, wherein the target image includes a left view and a right view. Furthermore, performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features can make the obtained left view features and right view features more in line with the requirements of disparity estimation. Furthermore, image segmentation is performed on the left view feature to obtain the first image feature, and the left view feature, the first image feature and the right view feature are combined to obtain the target cost volume, and the preset three-dimensional convolutional hourglass model is used to The target cost volume performs disparity estimation to obtain an estimated disparity map. In this way, semantic information can be used to assist disparity estimation and improve the reliability of disparity estimation. Finally, the estimated disparity map is semantically refined through the preset semantic refinement network and the first image features to obtain the target disparity map, which can improve the scene semantic reliability of the estimated disparity and enhance the understanding of the scene for the stereo matching task. The semantic information of the scene can improve the effect of disparity estimation in inappropriate regions, thereby improving the accuracy of disparity estimation and reducing the error of disparity maps.

The embodiments described in the embodiments of the present application are to illustrate the technical solutions of the embodiments of the present application more clearly, and do not constitute a limitation to the technical solutions provided by the embodiments of the present application. Those skilled in the art know that with the evolution of technology and new For the emergence of application scenarios, the technical solutions provided by the embodiments of the present application are also applicable to similar technical problems.

Those skilled in the art can understand that the technical solutions shown in Figures 1-7 do not constitute a limitation to the embodiments of the present application, and may include more or fewer steps than those shown in the illustrations, or combine certain steps, or be different A step of.

The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

If the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including multiple instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method in each embodiment of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as RAM), magnetic disk or optical disc, etc., which can store programs. medium.

The preferred embodiments of the embodiments of the present application have been described above with reference to the accompanying drawings, which does not limit the scope of rights of the embodiments of the present application. Any modifications, equivalent replacements and improvements made by those skilled in the art without departing from the scope and essence of the embodiments of the present application shall fall within the scope of rights of the embodiments of the present application.

Claims

A method for generating a disparity map, wherein the method includes:

Acquiring a target image, wherein the target image includes a left view and a right view;

performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features;

performing image segmentation processing on the left view feature to obtain a first image feature;

combining the left view feature, the first image feature and the right view feature to obtain a target cost body;

Performing disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

Semantic refinement is performed on the estimated disparity map by using a preset semantic refinement network and the first image feature to obtain a target disparity map.
The method for generating a disparity map according to claim 1, wherein the step of performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features ,include:

performing convolution processing on the left view to obtain convolution features of the left view, and performing convolution processing on the right view to obtain convolution features of the right view;

According to the preset multi-scale feature resolution parameters, perform pyramid pooling processing on the left view convolution features to obtain the multiple left view features, and according to the multi-scale feature resolution parameters, perform a pyramid pooling process on the right view The convolution features are subjected to pyramid pooling processing to obtain the plurality of right view features.
The method for generating a disparity map according to claim 1, wherein the step of performing image segmentation processing on the left view feature to obtain the first image feature comprises:

performing upsampling processing on the left view feature by a preset bilinear peak interpolation method to obtain a first view feature hidden variable;

performing feature sorting on the first view feature latent variable by a preset first function to obtain a first view feature sequence;

Perform convolution processing on the first view feature sequence to obtain the first image feature.
The method for generating a disparity map according to claim 1, wherein the step of combining the left view feature, the first image feature and the right view feature to obtain a target cost body includes:

According to the preset multi-scale feature resolution parameters, the left view feature and the right view feature are classified and combined to obtain an initial cost body;

The target cost volume is obtained by splicing the initial cost volume and the first image feature through a preset three-dimensional convolutional network.
The disparity map generation method according to claim 4, wherein the step of obtaining the target cost volume by splicing the initial cost volume and the first image features through a preset three-dimensional convolutional network, include:

Regularizing the initial cost body through the three-dimensional convolutional network to obtain a first intermediate cost body, and performing regularization processing on the first image features through the three-dimensional convolutional network to obtain a first intermediate image feature;

Performing downsampling processing on the first intermediate cost body through the three-dimensional convolutional network to obtain a second intermediate cost body, and performing upsampling processing on the first intermediate image features to obtain second intermediate image features;

The target cost volume is obtained by splicing the second intermediate cost volume and the second intermediate image features through the three-dimensional convolutional network.
The method for generating a disparity map according to claim 1, wherein the three-dimensional convolutional hourglass model includes an aggregation layer and a prediction layer, and the disparity estimation is performed on the target cost volume through the preset three-dimensional convolutional hourglass model to obtain The steps of estimating the disparity map include:

performing cost aggregation processing on the target cost body through the aggregation layer to obtain a fusion cost body;

Performing disparity estimation on the fused cost volume through the second function of the prediction layer to obtain the estimated disparity map.
The method for generating a disparity map according to any one of claims 1 to 6, wherein the estimated disparity map is semantically refined through the preset semantic refinement network and the first image features to obtain the target The steps of the disparity map include:

Performing probability calculation on the first image feature through the third function of the semantic refinement network to generate a semantic probability map;

performing convolution processing on the estimated disparity map through the semantic refinement network to obtain estimated disparity features;

performing fusion processing on the semantic probability map and the estimated disparity feature through the semantic refinement network to obtain preliminary disparity features;

Decoding the preliminary disparity feature through the semantic refinement network to obtain the target disparity map.
A device for generating a disparity map, wherein the device includes:

An image acquisition module, configured to acquire a target image, wherein the target image includes a left view and a right view;

A feature extraction module, configured to perform feature extraction on the left view to obtain multiple left view features, and perform feature extraction on the right view to obtain multiple right view features;

An image segmentation module, configured to perform image segmentation processing on the left view feature to obtain the first image feature;

A fusion module, configured to combine the left view features, the first image features, and the right view features to obtain a target cost body;

A disparity estimation module, configured to perform disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

The semantic refinement module is configured to perform semantic refinement processing on the estimated disparity map through a preset semantic refinement network and the first image feature, to obtain a target disparity map.
An electronic device, wherein the electronic device includes a memory, a processor, a program stored on the memory and operable on the processor, and a program for realizing the connection between the processor and the memory A data bus for communication, when the program is executed by the processor, a method for generating a disparity map is implemented, wherein the method for generating a disparity map includes:

Acquiring a target image, wherein the target image includes a left view and a right view;

performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features;

performing image segmentation processing on the left view feature to obtain a first image feature;

combining the left view feature, the first image feature and the right view feature to obtain a target cost body;

Performing disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

Semantic refinement is performed on the estimated disparity map by using a preset semantic refinement network and the first image feature to obtain a target disparity map.
The electronic device according to claim 9, wherein the step of performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features includes :

performing convolution processing on the left view to obtain convolution features of the left view, and performing convolution processing on the right view to obtain convolution features of the right view;

According to the preset multi-scale feature resolution parameters, perform pyramid pooling processing on the left view convolution features to obtain the multiple left view features, and according to the multi-scale feature resolution parameters, perform a pyramid pooling process on the right view The convolution features are subjected to pyramid pooling processing to obtain the plurality of right view features.
The electronic device according to claim 9, wherein the step of performing image segmentation processing on the left view feature to obtain the first image feature comprises:

performing upsampling processing on the left view feature by a preset bilinear peak interpolation method to obtain a first view feature hidden variable;

performing feature sorting on the first view feature latent variable by a preset first function to obtain a first view feature sequence;

Perform convolution processing on the first view feature sequence to obtain the first image feature.
The electronic device according to claim 9, wherein the step of combining the left view feature, the first image feature and the right view feature to obtain a target cost body includes:

According to the preset multi-scale feature resolution parameters, the left view feature and the right view feature are classified and combined to obtain an initial cost body;

The initial cost volume and the first image feature are spliced through a preset three-dimensional convolutional network to obtain the target cost volume.
The electronic device according to claim 12, wherein the step of splicing the initial cost body and the first image feature through a preset three-dimensional convolutional network to obtain the target cost body includes:

Regularizing the initial cost body through the three-dimensional convolutional network to obtain a first intermediate cost body, and performing regularization processing on the first image features through the three-dimensional convolutional network to obtain a first intermediate image feature;

Performing downsampling processing on the first intermediate cost body through the three-dimensional convolutional network to obtain a second intermediate cost body, and performing upsampling processing on the first intermediate image features to obtain second intermediate image features;

The target cost volume is obtained by splicing the second intermediate cost volume and the second intermediate image features through the three-dimensional convolutional network.
The electronic device according to claim 9, wherein the three-dimensional convolutional hourglass model includes an aggregation layer and a prediction layer, and the estimated parallax is obtained by performing disparity estimation on the target cost volume through the preset three-dimensional convolutional hourglass model Figure steps, including:

performing cost aggregation processing on the target cost body through the aggregation layer to obtain a fusion cost body;

Performing disparity estimation on the fused cost volume through the second function of the prediction layer to obtain the estimated disparity map.
A storage medium, the storage medium is a computer-readable storage medium for computer-readable storage, wherein the storage medium stores one or more programs, and the one or more programs can be used by one or more Executed by a processor to implement a method for generating a disparity map, wherein the method for generating a disparity map includes:

Acquiring a target image, wherein the target image includes a left view and a right view;

performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features;

performing image segmentation processing on the left view feature to obtain a first image feature;

combining the left view feature, the first image feature and the right view feature to obtain a target cost body;

Performing disparity estimation on the target cost volume through a preset three-dimensional convolutional hourglass model to obtain an estimated disparity map;

Semantic refinement is performed on the estimated disparity map by using a preset semantic refinement network and the first image feature to obtain a target disparity map.
The storage medium according to claim 15, wherein the step of performing feature extraction on the left view to obtain multiple left view features, and performing feature extraction on the right view to obtain multiple right view features includes :

performing convolution processing on the left view to obtain convolution features of the left view, and performing convolution processing on the right view to obtain convolution features of the right view;

According to the preset multi-scale feature resolution parameters, perform pyramid pooling processing on the left view convolution features to obtain the multiple left view features, and according to the multi-scale feature resolution parameters, perform a pyramid pooling process on the right view The convolution features are subjected to pyramid pooling processing to obtain the plurality of right view features.
The storage medium according to claim 15, wherein the step of performing image segmentation processing on the left view feature to obtain the first image feature comprises:

performing upsampling processing on the left view feature by a preset bilinear peak interpolation method to obtain a first view feature hidden variable;

performing feature sorting on the first view feature latent variable by a preset first function to obtain a first view feature sequence;

Perform convolution processing on the first view feature sequence to obtain the first image feature.
The storage medium according to claim 15, wherein the step of combining the left view feature, the first image feature and the right view feature to obtain a target cost body includes:

According to the preset multi-scale feature resolution parameters, the left view feature and the right view feature are classified and combined to obtain an initial cost body;

The target cost volume is obtained by splicing the initial cost volume and the first image feature through a preset three-dimensional convolutional network.
The storage medium according to claim 18, wherein the step of splicing the initial cost body and the first image feature through a preset three-dimensional convolutional network to obtain the target cost body includes:

Regularizing the initial cost body through the three-dimensional convolutional network to obtain a first intermediate cost body, and performing regularization processing on the first image features through the three-dimensional convolutional network to obtain a first intermediate image feature;

Performing downsampling processing on the first intermediate cost body through the three-dimensional convolutional network to obtain a second intermediate cost body, and performing upsampling processing on the first intermediate image features to obtain second intermediate image features;

The target cost volume is obtained by splicing the second intermediate cost volume and the second intermediate image features through the three-dimensional convolutional network.
The storage medium according to claim 15, wherein the three-dimensional convolutional hourglass model includes an aggregation layer and a prediction layer, and the estimated parallax is obtained by performing disparity estimation on the target cost volume through the preset three-dimensional convolutional hourglass model Figure steps, including:

performing cost aggregation processing on the target cost body through the aggregation layer to obtain a fusion cost body;

Performing disparity estimation on the fused cost volume through the second function of the prediction layer to obtain the estimated disparity map.