CN114519710A

CN114519710A - Disparity map generation method and device, electronic equipment and storage medium

Info

Publication number: CN114519710A
Application number: CN202210162805.0A
Authority: CN
Inventors: 唐小初; 张祎頔; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-05-20
Anticipated expiration: 2042-02-22
Also published as: WO2023159757A1; CN114519710B

Abstract

The embodiment of the application provides a disparity map generation method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a target image, wherein the target image comprises a left view and a right view; performing feature extraction on the left view to obtain a plurality of left view features, and performing feature extraction on the right view to obtain a plurality of right view features; performing image segmentation processing on the left view characteristic to obtain a first image characteristic; combining the left view characteristic, the first image characteristic and the right view characteristic to obtain a target cost body; performing parallax estimation on a target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax image; and performing semantic thinning processing on the estimated disparity map through a preset semantic thinning network and the first image characteristics to obtain a target disparity map. According to the embodiment of the application, the accuracy of disparity estimation can be improved, and the error of the target disparity map is reduced.

Description

Disparity map generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a disparity map generation method and apparatus, an electronic device, and a storage medium.

Background

Disparity estimation is a fundamental computer vision problem aimed at predicting the distance measurement of each point in a target scene. When disparity estimation is performed by the conventional stereo matching algorithm, difficulties are usually encountered in uncertain areas such as weak texture, repeated texture, occlusion and the like, and disparity estimation cannot be performed on a target object accurately, so that the generated disparity map has a large error. Therefore, how to improve the accuracy of disparity estimation and reduce the error of a disparity map becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiments of the present application mainly aim to provide a disparity map generation method and apparatus, an electronic device, and a storage medium, which aim to improve accuracy of disparity estimation and reduce errors of disparity maps.

To achieve the above object, a first aspect of an embodiment of the present application provides a disparity map generating method, including:

acquiring a target image, wherein the target image comprises a left view and a right view;

performing feature extraction on the left view to obtain a plurality of left view features, and performing feature extraction on the right view to obtain a plurality of right view features;

performing image segmentation processing on the left view characteristic to obtain a first image characteristic;

The left view feature, the first image feature and the right view feature are combined to obtain a target cost body;

performing parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax image;

and performing semantic thinning processing on the estimated disparity map through a preset semantic thinning network and the first image characteristics to obtain a target disparity map.

In some embodiments, the step of performing feature extraction on the left view to obtain a plurality of left view features, and performing feature extraction on the right view to obtain a plurality of right view features includes:

performing convolution processing on the left view to obtain left view convolution characteristics, and performing convolution processing on the right view to obtain right view convolution characteristics;

and carrying out pyramid pooling on the left view convolution characteristics according to a preset multi-scale characteristic resolution parameter to obtain a plurality of left view characteristics, and carrying out pyramid pooling on the right view convolution characteristics according to the multi-scale characteristic resolution parameter to obtain a plurality of right view characteristics.

In some embodiments, the step of performing image segmentation processing on the left-view feature to obtain a first image feature includes:

Performing up-sampling processing on the left view characteristic through a preset bilinear peak interpolation method to obtain a first view characteristic hidden variable;

performing feature sorting on the first view feature hidden variables through a preset first function to obtain a first view feature sequence;

and performing convolution processing on the first view feature sequence to obtain the first image feature.

In some embodiments, the step of combining the left view feature, the first image feature, and the right view feature to obtain a target cost volume includes:

according to preset multi-scale feature resolution parameters, carrying out classification combination processing on the left view features and the right view features to obtain an initial cost body;

and splicing the initial cost body and the first image characteristics through a preset three-dimensional convolution network to obtain the target cost body.

In some embodiments, the step of obtaining the target cost volume by performing stitching processing on the initial cost volume and the first image feature through a preset three-dimensional convolution network includes:

regularizing the initial cost body through the three-dimensional convolution network to obtain a first intermediate cost body, and regularizing the first image feature through the three-dimensional convolution network to obtain a first intermediate image feature;

Performing down-sampling processing on the first intermediate cost body through the three-dimensional convolution network to obtain a second intermediate cost body, and performing up-sampling processing on the first intermediate image characteristic to obtain a second intermediate image characteristic;

and splicing the second intermediate cost body and the second intermediate image characteristics through the three-dimensional convolution network to obtain the target cost body.

In some embodiments, the three-dimensional convolution hourglass model includes an aggregation layer and a prediction layer, and the step of performing disparity estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated disparity map includes:

performing cost aggregation processing on the target cost body through the aggregation layer to obtain a fusion cost body;

and performing parallax estimation on the fusion cost body through a second function of the prediction layer to obtain the estimated parallax image.

In some embodiments, the semantic refining processing on the estimated disparity map through a preset semantic refining network and the first image feature to obtain a target disparity map includes:

performing probability calculation on the first image characteristic through a third function of the semantic refining network to generate a semantic probability map;

Carrying out convolution processing on the estimated disparity map through the semantic refining network to obtain estimated disparity characteristics;

fusing the semantic probability map and the estimated parallax features through the semantic refining network to obtain preliminary parallax features;

and decoding the preliminary parallax features through the semantic refining network to obtain the target parallax image.

To achieve the above object, a second aspect of an embodiment of the present application proposes a disparity map generating apparatus, including:

the device comprises an image acquisition module, a display module and a display module, wherein the image acquisition module is used for acquiring a target image, and the target image comprises a left view and a right view of a target object;

the feature extraction module is used for extracting features of the left view to obtain a plurality of left view features and extracting features of the right view to obtain a plurality of right view features;

the image segmentation module is used for carrying out image segmentation processing on the left view characteristic to obtain a first image characteristic;

the fusion module is used for carrying out fusion processing on the left view characteristic, the first image characteristic and the right view characteristic to obtain a target cost body;

the parallax estimation module is used for carrying out parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax image;

And the semantic refining module is used for performing semantic refining processing on the estimated disparity map through a preset semantic refining network and the first image characteristics to obtain a target disparity map.

In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored in the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, where the program implements the method of the first aspect when executed by the processor.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.

According to the disparity map generation method and device, the electronic equipment and the storage medium, the target image is obtained, wherein the target image comprises a left view and a right view. Furthermore, feature extraction is carried out on the left view to obtain a plurality of left view features, feature extraction is carried out on the right view to obtain a plurality of right view features, and the obtained left view features and the right view features can better meet the requirement of parallax estimation. And then, performing image segmentation processing on the left view characteristics to obtain first image characteristics, performing combination processing on the left view characteristics, the first image characteristics and the right view characteristics to obtain a target cost body, and performing parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax map. And finally, performing semantic thinning processing on the estimated disparity map through a preset semantic thinning network and the first image characteristics to obtain a target disparity map, so that the understanding of a stereo matching task on a scene can be enhanced, the accuracy of disparity estimation is improved, and the error of the disparity map is reduced.

Drawings

Fig. 1 is a flowchart of a disparity map generating method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S102 in fig. 1;

FIG. 3 is a flowchart of step S103 in FIG. 1;

fig. 4 is a flowchart of step S104 in fig. 1;

FIG. 5 is a flowchart of step S402 in FIG. 4;

fig. 6 is a flowchart of step S105 in fig. 1;

FIG. 7 is a flowchart of step S106 in FIG. 1;

fig. 8 is a schematic structural diagram of a disparity map generating device according to an embodiment of the present application;

fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

First, several terms referred to in the present application are resolved:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and it belongs to a branch of artificial intelligence, which is a cross discipline of computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Parallax: parallax is the difference in direction that results from viewing the same object from two points that are some distance away. The angle between two points viewed from the object is called the parallax angle of the two points, and the line connecting the two points is called the baseline. In general, if the parallax angle and the base length are to be known, the distance between the object and the observer can be calculated.

And (3) parallax estimation: disparity estimation is a fundamental computer vision problem aimed at predicting the distance measurement of each point in a target scene.

Residual network (ResNet): all the hierarchical features of the original LR image are exploited by Residual Dense Blocks (RDBs). For a very deep network, it is difficult, and can be said impractical, to extract the output of each convolutional layer in the LR space directly. Residual Dense Blocks (RDBs) are used as building blocks for RDNs. RDB contains dense connected layers and Local Feature Fusion (LFF) with Local Residual Learning (LRL). The residual dense block also supports continuous memory between RDBs. The output of one RDB may directly access the next RDB layer, thereby allowing the state to be passed in succession. Each convolutional layer of the RDB has access to all subsequent layers, passing information that needs to be preserved. The previous RDB is concatenated with the state of all previous layers of the current RDB, and the LFF extracts locally dense features by adaptively saving information. Furthermore, LFF achieves extremely high growth rates by stabilizing the training of larger networks. After extracting the multilayer local dense features, Global Feature Fusion (GFF) is further carried out to adaptively reserve the hierarchical features in a global mode. Each layer can directly access the original LR input, resulting in implicit deep supervised learning. The residual network is characterized by easy optimization and can improve accuracy by adding considerable depth. The inner residual block uses jump connection, and the problem of gradient disappearance caused by depth increase in a deep neural network is relieved. The residual network is composed of a series of residual blocks. The residual block is divided into two parts, a direct mapped part and a residual part. The residual part is typically made up of two or three convolution operations.

Feature Pyramid (Feature Pyramid Network, FPN): the feature pyramid is firstly proposed to be used in a small target detection task, feature information under different scales is fully combined, low-level features have high resolution under a large scale, and semantic information contained in high-level features is richer. FPN plays an excellent role in ImageNet and CoCo detection tasks. Based on the fact, the method is applied to the stereo matching field, for the convolutional neural network, different depths correspond to different levels of semantic features, the shallow network is high in resolution, more learners are detail features, the deep network is low in resolution, and more learners are semantic features. In the FPN network, the feature map of each resolution introduces the feature map with twice the next resolution scaling for element level addition operation. Through the connection, the feature maps of different layers are fused with features with different resolutions and different semantic strengths. This ensures that each layer has the proper resolution and strong semantic features. In addition, the method only adds extra cross-layer connection on the basis of the original network, and hardly increases extra time and calculation amount in practical application. The experimental result shows that the use of the feature pyramid network has a remarkable effect on the stereo matching task.

Upsampling (upsampling): up-sampling refers to enlarging an image, also known as image interpolation (interpolation), and its main purpose is to enlarge the original image so that it can be displayed on a higher resolution display device. The up-sampling principle: the image amplification almost adopts an interpolation method, namely, a proper interpolation algorithm is adopted to insert new elements among pixel points on the basis of the original image pixels. The interpolation algorithm mainly comprises an edge-based image interpolation algorithm and a region-based image interpolation algorithm.

Downsampled (subsampled): down-sampling refers to reducing an image, also called down-sampling (down sampled), and is mainly aimed at making the image fit the size of the display area and generating a thumbnail of the corresponding image. The down-sampling principle: for an image I with size M × N, s-fold down sampling is performed to obtain a resolution image with size (M/s) × (N/s), of course, s should be a common divisor of M and N, if an image in matrix form is considered, the image in the original image s × s window is changed into a pixel, and the value of the pixel is the average value of all pixels in the window.

Stereoscopic vision: stereoscopic vision is a computer vision technique that simulates human eye vision, estimates parallax from calibrated left and right views, and acquires three-dimensional information of a scene using triangle similarity through external parameters such as focal length, distance, and the like. The stereo matching technique is to calculate the disparity of the left and right images of each pixel by matching the corresponding pixels along conjugate epipolar lines using two horizontally corrected left and right views, thereby realizing depth estimation. The stereo matching technology is widely applied to the fields of automatic driving, three-dimensional reconstruction, robot control, medical diagnosis and the like.

At present, when disparity estimation is carried out by a stereo matching algorithm, difficulties are usually encountered in uncertain areas such as weak texture, repeated texture and shielding, and disparity estimation cannot be carried out on a target object accurately. Therefore, how to improve the accuracy of disparity estimation becomes a technical problem to be solved urgently.

Based on this, embodiments of the present application provide a disparity map generation method and apparatus, an electronic device, and a storage medium, which aim to improve accuracy of disparity estimation and reduce an error of a disparity map.

Specifically, the following embodiments are provided to describe a method and an apparatus for generating a disparity map, an electronic device, and a storage medium, and first describe the method for generating a disparity map in the embodiments of the present application.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the application provides a disparity map generation method, and relates to the technical field of artificial intelligence. The disparity map generation method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like that implements the disparity map generation method, but is not limited to the above form.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a disparity map generating method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, step S101 to step S106.

Step S101, acquiring a target image, wherein the target image comprises a left view and a right view;

step S102, extracting features of the left view to obtain a plurality of left view features, and extracting features of the right view to obtain a plurality of right view features;

step S103, performing image segmentation processing on the left view feature to obtain a first image feature;

step S104, combining the left view characteristic, the first image characteristic and the right view characteristic to obtain a target cost body;

s105, performing parallax estimation on a target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax image;

and S106, performing semantic thinning processing on the estimated disparity map through a preset semantic thinning network and the first image characteristics to obtain a target disparity map.

In steps S101 to S106 of the embodiment of the application, a plurality of left view features are obtained by performing feature extraction on the left view, and a plurality of right view features are obtained by performing feature extraction on the right view, so that the obtained left view features and right view features can better meet the requirement of disparity estimation. The method comprises the steps of carrying out image segmentation processing on left view features to obtain first image features, carrying out combination processing on the left view features, the first image features and the right view features to obtain a target cost body, carrying out parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax image, and being capable of utilizing semantic information to assist parallax estimation and improving reliability of parallax estimation. And finally, semantic refining processing is carried out on the estimated disparity map through a preset semantic refining network and the first image characteristics to obtain a target disparity map, so that the understanding of a stereo matching task on a scene can be enhanced, the accuracy of disparity estimation is improved, and the error of the generated target disparity map is reduced. In step S101 of some embodiments, the target image may be a two-dimensional image or a three-dimensional image; in some embodiments, the target image may be obtained by Computed Tomography (CT), in another embodiment, the target image may be obtained by Magnetic Resonance Imaging (MRI), and in some other embodiments, the target image may be obtained by a binocular camera, and the like, without limitation. The left view and the right view are a left view and a right view shot by the binocular camera.

Before step S102 in some embodiments, the disparity map generating method further includes pre-constructing a stereo matching network, where the stereo matching network mainly includes a feature extraction module, an image segmentation module, a disparity estimation module, and a semantic refinement module, where the feature extraction module mainly includes a residual error network and is configured to perform feature extraction on an input target image; the image segmentation module mainly comprises a PSPNet decoding network and is used for sampling the target image subjected to feature extraction; the parallax estimation module is mainly composed of a three-dimensional convolution network and used for performing parallax estimation on the sampled target image to generate an estimated parallax map; the semantic refining module is mainly composed of a semantic refining network, and the language refining network mainly comprises a convolution layer and a full connection layer and is used for performing semantic refining processing on the estimated disparity map to generate a target disparity map.

Referring to fig. 2, in some embodiments, the feature extraction module includes a residual network and a pooling layer, and step S102 may include, but is not limited to, steps S201 to S202:

step S201, performing convolution processing on the left view to obtain left view convolution characteristics, and performing convolution processing on the right view to obtain right view convolution characteristics;

Step S202, carrying out pyramid pooling on the left view convolution features according to preset multi-scale feature resolution parameters to obtain a plurality of left view features, and carrying out pyramid pooling on the right view convolution features according to the multi-scale feature resolution parameters to obtain a plurality of right view features.

In step S201 of some embodiments, the left view and the right view are respectively subjected to feature extraction through a pre-residual network of the feature extraction module in the stereo matching network, specifically, the residual network is composed of a plurality of residual dense blocks, and the left view and the right view are respectively subjected to convolution processing through convolution layers of different residual dense blocks, so as to obtain a left view convolution feature and a right view convolution feature.

In step S202 of some embodiments, the left-view convolution feature and the right-view convolution feature are input to a pooling layer of the feature extraction module, and the left-view convolution feature and the right-view convolution feature are subjected to pyramid pooling respectively according to a multi-scale feature resolution parameter of the pooling layer, so that the multi-scale feature of the left view and the multi-scale feature of the right view can be obtained through pyramid pooling.

For example, according to preset multi-scale feature resolution parameters, pyramid pooling is performed on left view convolution features, so that the resolution of the obtained left view features is 1/4, 1/8 and 1/16 of the original left view resolution respectively; and performing pyramid pooling on the right view convolution features according to preset multi-scale feature resolution parameters, so that the resolution of the obtained multiple right view features is 1/4, 1/8 and 1/16 of the original right view resolution respectively. The method can fully combine the view feature information under different scales, and under the multi-scale condition, the low-level features can have high resolution, and the semantic information that the high-level features can contain is richer, so that the accuracy of view estimation is improved.

Referring to fig. 3, in some embodiments, the image segmentation module includes a decoding layer and a convolution layer, and step S103 may include, but is not limited to, step S301 to step S303:

step S301, performing up-sampling processing on left view characteristics through a preset bilinear peak interpolation method to obtain a first view characteristic hidden variable;

step S302, performing feature sorting on the first view feature hidden variables through a preset first function to obtain a first view feature sequence;

step S303, performing convolution processing on the first view feature sequence to obtain a first image feature.

In step S301 of some embodiments, the preset bilinear peak interpolation method mainly uses pixel values of 4 adjacent points, and different weights are given according to distances between the pixel values and an interpolation point to perform linear interpolation, the left view feature can be upsampled by the bilinear peak interpolation method, and the left view feature with different scales is upsampled to one fourth of an original resolution by the bilinear interpolation, so that the averaging low-pass filtering of the left view can be implemented, an edge of the left view is smoothed, a relatively coherent output image is generated, and accuracy of the first view feature hidden variable can also be improved.

For example, when performing the bi-linear peak interpolation calculation, 4 neighboring points around the (x, y) point on the left view may be taken, interpolated twice in the y direction (or x direction), and then interpolated once in the x direction (or y direction), to obtain the value f (x, y) of the (x, y) point. Let 4 neighbors be (i, j), (i, j +1), (i +1, j), (i +1, j +1), respectively, i represents the row number with the top left corner as the origin, and j represents the column number. Let α be x-i, β be y-j, and a straight line (x, y) is parallel to the x-axis and intersects the edge consisting of 4 adjacent points at point (i, y) and point (i +1, y). First, interpolation is performed in the y direction, and intersection values f (i, y) and f (i +1, y) are calculated. f (i, y) is interpolated from f (i, j +1) and f (i, j).

In step S302 of some embodiments, the first function is a concat function, and the first view feature hidden variables are connected in sequence by the concat function, so as to obtain a first view feature sequence.

In step S303 of some embodiments, the first view feature sequence is convolved by the convolution layer to obtain a plurality of first image features of different scales.

Referring to fig. 4, in some embodiments, step S104 may include, but is not limited to, step S401 to step S402:

step S401, according to preset multi-scale feature resolution parameters, carrying out classification combination processing on left view features and right view features to obtain an initial cost body;

And S402, splicing the initial cost body and the first image characteristics through a preset three-dimensional convolution network to obtain a target cost body.

In step S401 of some embodiments, the preset multi-scale feature resolution parameter may be 4,8,6, etc., and may be set according to practical situations, which is not limited thereto. And classifying and combining the plurality of left view features and the plurality of right view features according to different multi-scale feature resolution parameters, for example, performing vector addition on the left view features and the right view features of which the multi-scale feature resolution parameters are all 4 to obtain the view features of which the multi-scale feature resolution parameters are 4. Wherein, the size of the initial cost body can be expressed as

H and W are image sizes of the target image, where H is the height of the target image, W is the width of the target image, D is the parallax search range, C is the number of feature channels, s is the down-sampling rate, and s is 4,8, 6.

It should be noted that the cost body is a low-cost resolution cost volume constructed on different scales, and refers to an intermediate result obtained in an image stitching process. Specifically, since most stereo matching processes are binocular stereo matching, the input of the stereo matching network is usually two images, i.e., a left view and a right view, and when the left view and the right view are stitched, the stereo matching network initially sets a maximum disparity, for example, the maximum disparity is 5, and then five different-scale stitching operations are performed on the left view and the right view, where the disparity values of the five stitching operations are equal to 0, 1, 2, 3, and 4, respectively. When the parallax value is 0, directly splicing the left view and the right view; when the parallax value is 1, splicing the left view and the right view by 1 pixel in a staggered manner; and when the parallax value is 2, splicing 2 pixels of dislocation between the left view and the right view, when the parallax value is 3, splicing 3 pixels of dislocation between the left view and the right view, and when the parallax value is 4, splicing 4 pixels of dislocation between the left view and the right view. The tensor size of the original left view and right view is W H3, wherein W denotes the image width, H denotes the image height, 3 denotes the number of channels, the tensor size of the left view and right view is three-dimensional, the tensor size of the spliced target view is W H3 5, the target view is a cost body, and the tensor size of the target view is four-dimensional. In short, according to a preset maximum parallax, the input images are spliced under different scales, and an obtained intermediate product is a cost body. And further inputting the cost body into a stereo matching network for matching each pixel point, so that a fused cost body can be obtained, and meanwhile, the stereo matching network can remove the maximum parallax parameter (for example, the maximum parallax value 5) in the tensor size of the cost body, so that the tensor size of the output image is still three-dimensional, namely the tensor size of the output image is W x H3.

Referring to fig. 5, in some embodiments, step S402 may further include, but is not limited to, step S501 to step S503:

step S501, regularizing the initial cost body through a three-dimensional convolution network to obtain a first intermediate cost body, and regularizing the first image features through the three-dimensional convolution network to obtain first intermediate image features;

step S502, carrying out down-sampling processing on the first intermediate cost body through a three-dimensional convolution network to obtain a second intermediate cost body, and carrying out up-sampling processing on the first intermediate image characteristic to obtain a second intermediate image characteristic;

and S503, splicing the second intermediate cost body and the second intermediate image characteristic through a three-dimensional convolution network to obtain a target cost body.

In step S501 in some embodiments, the initial cost volume is regularized by three-dimensional convolution of a three-dimensional convolution network to obtain a first intermediate cost volume a, which includes a feature resolution of

A first intermediate cost body A1 with a characteristic resolution of

A first intermediate cost volume of A2, with a characteristic resolution of

And adjusting the number of channels of the first image feature with the multi-scale feature resolution parameter of 16 to the number of channels of the first image feature with the two-dimensional convolution to be the first intermediate cost body A3

And the first semantic features are subjected to two-dimensional to four-dimensional upgrading, the four-dimensional first image features are subjected to regularization processing through three-dimensional convolution of a three-dimensional convolution network, and the feature resolution is obtained

By the same operation, a first intermediate image feature B3 of feature resolution of

Of a first intermediate image feature B2, the feature resolution being

First intermediate image feature B1.

In steps S502 and S503 of some embodiments, the feature resolution is brought to 2 using a three-dimensional convolution with a step size of 2 in a three-dimensional convolution network

The first intermediate cost body A1 is processed by down-sampling, and the first intermediate cost body A1 and the characteristic resolution after down-sampling are as follows

The first intermediate cost bodies A2 are connected to obtain a second intermediate cost body, the channel number is adjusted through three-dimensional convolution, then the second intermediate cost body is subjected to down-sampling processing by utilizing the three-dimensional convolution with the step length of 2 in the three-dimensional convolution network, and the second intermediate cost body and the characteristic resolution ratio after the down-sampling processing are equal to

The first intermediate cost bodies A3 are connected to obtain a target view cost body.

Likewise, the features are resolved to

First intermediate image feature B3 and a feature resolution of

Is connected to obtain a characteristic resolution of

And resolving the features into a second intermediate image feature of (2) by three-dimensional deconvolution

The second intermediate image feature C3 is subjected to upsampling processing to obtain a feature resolution of

C2, adjusting the feature resolution by three-dimensional convolution to

The number of channels of the second intermediate image feature C2.

By doing so, the feature resolution is

First intermediate image feature B2 and a feature resolution of

Is connected to obtain a characteristic resolution of

C1, will resolve the feature to a resolution of

First intermediate image feature B1 and a feature resolution of

Is connected to obtain a characteristic resolution of

C1.

Finally, the feature resolution is set to

And the second intermediate image feature C1 and the target view cost body are spliced to obtain the target cost body.

Referring to fig. 6, in some embodiments, the three-dimensional convolution hourglass model includes an aggregation layer and a prediction layer, and step S105 includes, but is not limited to, steps S601 to S602:

step S601, carrying out cost aggregation processing on the target cost body through an aggregation layer to obtain a fusion cost body;

and step S602, performing parallax estimation on the fusion cost body through a second function of the prediction layer to obtain an estimated parallax map.

In step S601 in some embodiments, the three-dimensional convolution hourglass model includes two stacked aggregation layers, the aggregation layer structure is the same as that of the three-dimensional convolution network, the target cost bodies are respectively input into the two aggregation layers, the target cost bodies are subjected to cost aggregation through each aggregation layer, and then outputs of the two aggregation layers are subjected to fusion processing, so as to obtain a final fusion cost body.

In step S602 in some embodiments, the second function is a soft argmin function, and the fusion cost object obtained through aggregation can be relatively accurately disparity estimated through the soft argmin function, so as to obtain an estimated disparity map.

Referring to fig. 7, in some embodiments, the semantic refining network includes a convolutional layer and a fully-connected layer, and step S106 may further include, but is not limited to, step S701 to step S704:

step S701, performing probability calculation on the first image characteristics through a third function of the semantic refining network to generate a semantic probability map;

step S702, carrying out convolution processing on the estimated parallax image through a semantic thinning network to obtain estimated parallax characteristics;

step S703, fusing the semantic probability map and the estimated parallax features through a semantic thinning network to obtain preliminary parallax features;

And step S704, decoding the preliminary parallax features through a semantic refining network to obtain a target parallax image.

In step S701 of some embodiments, a third function is preset on a full connection layer of the semantic refining network, where the third function is a softmax function, the softmax function performs probability calculation on the first image feature, according to a calculation result, the softmax function creates a probability distribution on a preset semantic category label, and reflects semantic possibilities of the first image feature on different semantic category labels through a semantic probability map.

In step S702 of some embodiments, the estimated disparity map is subjected to two-dimensional convolution processing by convolution layers of the semantic refinement network, and image features of the estimated disparity map are captured, so as to obtain estimated disparity features.

In step S703 of some embodiments, vector multiplication is performed on the semantic probability map and the estimated disparity feature according to a preset weight ratio through a convolutional layer of a semantic refinement network, so as to implement feature fusion of the semantic feature and the estimated disparity feature, and obtain a preliminary disparity feature with semantic weighting.

In step S704 of some embodiments, the preliminary disparity features are convolution decoded and deconvolution up-sampled by the convolution layer of the semantic refining network, so as to obtain a target disparity map, where the target disparity map is used for reflecting the resolution disparity of the target image.

Through the steps S701 to S704, the disparity map generation method performs semantic category weighting on the disparity estimation result by using the image segmentation result, and then performs encoding and decoding, so as to improve scene semantic reliability of disparity estimation, enhance understanding of the stereo matching task on the scene, and improve a disparity estimation effect in an inappropriate area by using semantic information of the scene, thereby improving accuracy of disparity estimation and reducing errors of the disparity map.

The embodiment of the application acquires a target image, wherein the target image comprises a left view and a right view. Furthermore, feature extraction is carried out on the left view to obtain a plurality of left view features, feature extraction is carried out on the right view to obtain a plurality of right view features, and the obtained left view features and the right view features can better meet the requirement of parallax estimation. And then, performing image segmentation processing on the left view features to obtain first image features, performing combination processing on the left view features, the first image features and the right view features to obtain a target cost body, and performing parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax map. And finally, performing semantic thinning processing on the estimated disparity map through a preset semantic thinning network and the first image characteristics to obtain a target disparity map, so that the understanding of a stereo matching task on a scene can be enhanced, the accuracy of disparity estimation is improved, and the error of the disparity map is reduced.

Referring to fig. 8, an embodiment of the present application further provides a disparity map generating apparatus, which can implement the disparity map generating method, and the apparatus includes:

an image obtaining module 801, configured to obtain a target image, where the target image includes a left view and a right view of a target object;

the feature extraction module 802 is configured to perform feature extraction on the left view to obtain a plurality of left view features, and perform feature extraction on the right view to obtain a plurality of right view features;

an image segmentation module 803, configured to perform image segmentation processing on the left view feature to obtain a first image feature;

the fusion module 804 is configured to perform fusion processing on the left view feature, the first image feature and the right view feature to obtain a target cost body;

the disparity estimation module 805 is configured to perform disparity estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated disparity map;

and a semantic refining module 806, configured to perform semantic refining processing on the estimated disparity map through a preset semantic refining network and the first image feature to obtain a target disparity map.

The specific implementation of the disparity map generating apparatus is substantially the same as the specific implementation of the disparity map generating method described above, and is not described herein again.

An embodiment of the present application further provides an electronic device, where the electronic device includes: the disparity map generating device comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein when the program is executed by the processor, the disparity map generating method is realized. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:

the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;

the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the disparity map generating method according to the embodiments of the present disclosure;

An input/output interface 903 for inputting and outputting information;

a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between various components of the device (e.g., the processor 901, memory 902, input/output interface 903, and communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively connected to each other within the device via a bus 905.

The embodiment of the present application further provides a storage medium, which is a computer-readable storage medium for computer-readable storage, where the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the disparity map generating method.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the disparity map generation method, the disparity map generation device, the electronic device and the storage medium provided by the embodiment of the application, the target image is obtained, wherein the target image comprises a left view and a right view. Furthermore, feature extraction is carried out on the left view to obtain a plurality of left view features, feature extraction is carried out on the right view to obtain a plurality of right view features, and the obtained left view features and the obtained right view features can better meet the requirement of parallax estimation. And then, performing image segmentation processing on the left view characteristics to obtain first image characteristics, performing combination processing on the left view characteristics, the first image characteristics and the right view characteristics to obtain a target cost body, and performing parallax estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain an estimated parallax map. And finally, semantic refining processing is carried out on the estimated disparity map through a preset semantic refining network and the first image characteristics to obtain a target disparity map, scene semantic reliability of the estimated disparity can be improved, understanding of a stereo matching task on the scene is enhanced, and a disparity estimation effect of an inappropriate area can be improved by utilizing semantic information of the scene, so that the accuracy of disparity estimation is improved, and errors of the disparity map are reduced.

The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not intended to limit the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, which includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a portable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereby. Any modifications, equivalents, and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims

1. A disparity map generation method, comprising:

performing image segmentation processing on the left view feature to obtain a first image feature;

2. The disparity map generation method according to claim 1, wherein the step of performing feature extraction on the left view to obtain a plurality of left view features and performing feature extraction on the right view to obtain a plurality of right view features comprises:

Performing pyramid pooling on the left view convolution features according to preset multi-scale feature resolution parameters to obtain a plurality of left view features, and performing pyramid pooling on the right view convolution features according to the multi-scale feature resolution parameters to obtain a plurality of right view features.

3. The disparity map generation method according to claim 1, wherein the step of performing image segmentation processing on the left-view feature to obtain a first image feature includes:

4. The disparity map generation method according to claim 1, wherein the step of combining the left-view feature, the first image feature, and the right-view feature to obtain a target cost volume includes:

5. The disparity map generation method according to claim 4, wherein the step of obtaining the target cost volume by performing stitching processing on the initial cost volume and the first image feature through a preset three-dimensional convolution network includes:

6. The disparity map generation method according to claim 1, wherein the three-dimensional convolution hourglass model comprises an aggregation layer and a prediction layer, and the step of performing disparity estimation on the target cost body through a preset three-dimensional convolution hourglass model to obtain the estimated disparity map comprises:

and performing disparity estimation on the fusion cost body through a second function of the prediction layer to obtain the estimated disparity map.

7. The disparity map generation method according to any one of claims 1 to 6, wherein the step of performing semantic refinement processing on the estimated disparity map through a preset semantic refinement network and the first image feature to obtain a target disparity map comprises:

performing probability calculation on the first image characteristics through a third function of the semantic refining network to generate a semantic probability map;

8. A disparity map generating apparatus, comprising:

the device comprises an image acquisition module, a display module and a display module, wherein the image acquisition module is used for acquiring a target image, and the target image comprises a left view and a right view;

the fusion module is used for carrying out combined processing on the left view characteristic, the first image characteristic and the right view characteristic to obtain a target cost body;

9. An electronic device, characterized in that the electronic device comprises a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing a connection communication between the processor and the memory, the program, when executed by the processor, implementing the steps of the disparity map generating method according to any one of claims 1 to 7.

10. A storage medium which is a computer-readable storage medium for computer-readable storage, characterized in that the storage medium stores one or more programs which are executable by one or more processors to implement the steps of the disparity map generating method according to any one of claims 1 to 7.