CN117078984A

CN117078984A - Binocular image processing method and device, electronic equipment and storage medium

Info

Publication number: CN117078984A
Application number: CN202311343611.1A
Authority: CN
Inventors: 林愉欢; 汪铖杰; 刘永; 李嘉麟; 陈颖; 聂强; 付威福
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-10-17
Filing date: 2023-10-17
Publication date: 2023-11-17
Anticipated expiration: 2043-10-17
Also published as: CN117078984B

Abstract

The application discloses a binocular image processing method, a binocular image processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a binocular image to be processed, and respectively carrying out feature extraction on a left-eye image and a right-eye image to obtain a left-eye feature image and a right-eye feature image; splicing the left eye feature map and the right eye feature map to obtain a first feature map; performing grouping convolution processing on the first feature map based on a preset parallax layer level to obtain three-dimensional matching cost features; the three-dimensional matching cost characteristics represent the matching cost of pixel points of the binocular image to be processed at each parallax layer level; performing cost aggregation on the three-dimensional matching cost features based on a two-dimensional convolution network to obtain target matching cost features; and predicting a disparity map corresponding to the binocular image to be processed based on the target matching cost characteristics. The application avoids traversing parallax hierarchy and performing dense memory access operation, improves processing efficiency, can realize the whole process based on 2D convolution operation, and is easier to be deployed on various NPU chips while ensuring precision.

Description

Binocular image processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a binocular image processing method, apparatus, electronic device, and storage medium.

Background

Binocular stereo matching is one of the most fundamental problems in the field of computer vision, wherein the task of binocular stereo matching is to construct a parallax image corresponding to the scene based on binocular images of the same scene, and the binocular stereo matching is widely applied to binocular vision tasks such as robot vision, automatic driving and the like.

In the related art, a binocular stereo matching process based on deep learning needs to traverse each parallax hierarchy, offset right-eye features to extract corresponding features, participate in correlation calculation or be connected in series, so that intensive access to a memory is needed, and a large amount of 3D convolution operation is needed, so that the related technology cannot be deployed on an NPU (Neural network Processing Unit ) chip, and the application of the related technology in an embedded deployment scene with real-time reasoning requirements is limited.

Disclosure of Invention

In order to solve the problems in the prior art, the embodiment of the invention provides a binocular image processing method, a binocular image processing device, electronic equipment and a storage medium. The technical scheme is as follows:

In one aspect, there is provided a binocular image processing method, the method comprising:

obtaining a binocular image to be processed; the binocular image to be processed comprises a left-eye image and a right-eye image;

respectively extracting features of the left eye image and the right eye image to obtain a left eye feature image and a right eye feature image;

performing splicing treatment on the left eye feature map and the right eye feature map to obtain a first feature map;

performing grouping convolution processing on the first feature map based on a preset parallax layer level number to obtain a three-dimensional matching cost feature; the three-dimensional matching cost characteristics represent the matching cost of the pixel points of the binocular image to be processed at each parallax layer level;

performing cost aggregation processing on the three-dimensional matching cost features based on a two-dimensional convolution network to obtain target matching cost features;

and predicting a parallax image corresponding to the binocular image to be processed based on the target matching cost characteristics.

In another aspect, there is provided a binocular image processing apparatus, the apparatus comprising:

the image acquisition module is used for acquiring a binocular image to be processed; the binocular image to be processed comprises a left-eye image and a right-eye image;

the feature extraction module is used for extracting features of the left-eye image and the right-eye image respectively to obtain a left-eye feature image and a right-eye feature image;

The characteristic splicing module is used for carrying out splicing treatment on the left-eye characteristic diagram and the right-eye characteristic diagram to obtain a first characteristic diagram;

the three-dimensional matching cost determining module is used for carrying out grouping convolution processing on the first feature map based on the preset parallax layer level number to obtain three-dimensional matching cost features; the three-dimensional matching cost characteristics represent the matching cost of the pixel points of the binocular image to be processed at each parallax layer level;

the cost aggregation module is used for conducting cost aggregation processing on the three-dimensional matching cost features based on a two-dimensional convolution network to obtain target matching cost features;

and the disparity map predicting module is used for predicting the disparity map corresponding to the binocular image to be processed based on the target matching cost characteristics.

In an exemplary embodiment, the three-dimensional matching cost determining module includes:

the first convolution processing module is used for carrying out convolution processing on the feature images of the corresponding channels based on first convolution cores corresponding to the channels of the first feature images respectively to obtain a plurality of channel convolution feature images;

the second convolution processing module is used for carrying out convolution processing of space dimensions on the plurality of channel convolution feature graphs based on a first number of second convolution kernels to obtain a first number of first convolution feature graphs; the first number is twice the number of preset parallax layer stages;

The feature stacking module is used for stacking the first convolution feature images of the first number to obtain a second feature image;

the third convolution processing module is used for respectively carrying out convolution processing on the second feature images based on a second number of third convolution kernels to obtain the second number of second convolution feature images; the second number is the preset parallax layer number;

wherein the second number of second convolution feature maps constitute the three-dimensional matching cost feature.

In an exemplary embodiment, the third convolution processing module includes:

a fourth convolution module, configured to take the second feature map as a current feature map, and sequentially perform depth convolution and point-by-point convolution on the current feature map to obtain a third feature map;

the updating module is used for updating the current feature map into the third feature map, and executing the steps of sequentially carrying out deep convolution and point-by-point convolution on the current feature map until the updating times reach the preset times and stopping the updating;

and a fifth convolution module, configured to perform convolution processing on the third feature graphs when the update is stopped based on the second number of third convolution kernels, so as to obtain the second number of second convolution feature graphs.

In an exemplary embodiment, the fifth convolution module is specifically configured to: dividing the channels of the third feature map when the updating is stopped into the channel feature map groups of the second number; and respectively carrying out convolution processing on the second number of channel characteristic image groups based on the second number of third convolution kernels, and obtaining the second number of second convolution characteristic images based on the result of the convolution processing.

In an exemplary embodiment, the feature extraction module includes:

the first feature extraction submodule is used for carrying out feature extraction on the left-eye image to obtain a left-eye feature image of a first scale;

the first feature extraction sub-module is used for extracting features of the right eye image to obtain a right eye feature image of the first scale;

the downsampling module is used for downsampling the left eye feature map of the first scale to obtain a downsampled left eye feature map of at least one scale;

correspondingly, the characteristic splicing module is specifically configured to: and performing splicing treatment on the left eye feature map of the first scale and the right eye feature map of the first scale to obtain a first feature map.

In an exemplary embodiment, the cost aggregation module is specifically configured to: respectively inputting the left eye feature map of the first scale, the downsampled left eye feature map of each scale and the three-dimensional matching cost feature as input features into a two-dimensional hourglass network, and carrying out convolution processing on each input feature by the two-dimensional hourglass network to obtain a convolution result corresponding to each input feature; and fusing convolution results corresponding to the input features to obtain target matching cost features.

In an exemplary embodiment, the apparatus further comprises a training module comprising:

the sample image acquisition module is used for acquiring a sample binocular image and a standard parallax image corresponding to the sample binocular image; the sample binocular image comprises a sample left-eye image and a sample right-eye image;

the sample feature extraction module is used for respectively carrying out feature extraction on the sample left-eye image and the sample right-eye image based on a feature extraction network of the binocular image processing model to obtain a sample left-eye feature image and a sample right-eye feature image;

the sample characteristic splicing module is used for carrying out splicing treatment on the sample left-eye characteristic diagram and the sample right-eye characteristic diagram to obtain a first sample characteristic diagram;

The sample three-dimensional matching cost determining module is used for inputting the first sample feature map into a grouping convolution network of the binocular image processing model, and the grouping convolution network performs grouping convolution processing on the first sample feature map based on a preset parallax layer level number to obtain sample three-dimensional matching cost features; the three-dimensional matching cost characteristics of the sample represent the matching cost of the pixel points of the binocular image of the sample at each parallax layer level;

the sample cost aggregation module is used for carrying out cost aggregation processing on the sample three-dimensional matching cost characteristics based on the two-dimensional convolution network of the binocular image processing model to obtain sample target matching cost characteristics;

and the training sub-module is used for predicting a sample parallax image corresponding to the sample binocular image based on the sample target matching cost characteristic, and adjusting model parameters of the binocular image processing model based on the difference between the sample parallax image and the standard parallax image until a preset training ending condition is reached.

In one exemplary embodiment, the packet convolutional network includes a first packet convolutional network and a second packet convolutional network; the first packet convolutional network comprises a first convolutional network and a second convolutional network, the number of first convolutional kernels in the first convolutional network is consistent with the channel number of the first sample feature map, the number of second convolutional kernels in the second convolutional network is twice the number of the preset parallax layer stages, and the number of third convolutional kernels in the second packet convolutional network is the number of the preset parallax layer stages;

The sample three-dimensional matching cost determining module is specifically configured to, when executing the grouping convolution processing on the first sample feature map by the grouping convolution network based on a preset parallax layer level number to obtain a sample three-dimensional matching cost feature: performing convolution processing on the feature images of the corresponding channels based on first convolution cores in the first convolution network, which correspond to the channels of the first sample feature images respectively, so as to obtain a plurality of sample channel convolution feature images; performing space dimension convolution processing on the plurality of sample channel convolution feature graphs based on each second convolution kernel in the second convolution network to obtain a first sample convolution feature graph corresponding to each second convolution kernel; stacking the first sample convolution feature graphs corresponding to the second convolution kernels to obtain second sample feature graphs; respectively carrying out convolution processing on the second sample feature images based on each third convolution kernel in the second grouping convolution network to obtain second sample convolution feature images of the preset parallax layer series; and the second sample convolution feature map of the preset parallax layer series forms the three-dimensional matching cost feature.

In an exemplary embodiment, the packet convolution network further includes a plurality of cascaded third packet convolution networks, and the sample three-dimensional matching cost determining module is specifically configured to, when executing the convolution processing on the second sample feature map based on each third convolution kernel in the second packet convolution network to obtain a second sample convolution feature of the preset parallax layer number: sequentially convolving the second sample feature map based on the plurality of cascaded third packet convolutional networks; each third grouping convolution network sequentially executes depth convolution and point-to-point convolution on respective current input; and taking the convolution processing result of the last third packet convolution network as the input of the second packet convolution network, and respectively carrying out convolution processing on the convolution processing result by each third convolution kernel in the second packet convolution network to obtain a second sample convolution characteristic diagram of the preset parallax layer series.

In an exemplary embodiment, the sample three-dimensional matching cost determining module is specifically configured to, when executing the convolution processing of each third convolution kernel in the second packet convolution network to each convolution processing result to obtain a second sample convolution feature map of the preset parallax layer progression: dividing a plurality of channels of the convolution processing result based on the number of third convolution kernels in the second grouping convolution network to obtain a plurality of sample channel characteristic diagram groups; and respectively inputting the plurality of sample channel feature graphs to corresponding third convolution kernels, carrying out convolution processing on the sample channel feature graphs input by the corresponding third convolution kernels, and obtaining a second sample convolution feature graph of the preset parallax layer series based on the result of the convolution processing of each third convolution kernel.

In another aspect, there is provided an electronic device comprising a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or the at least one program being loaded and executed by the processor to implement the binocular image processing method of any of the above aspects.

In another aspect, a computer readable storage medium having stored therein at least one instruction or at least one program loaded and executed by a processor to implement a binocular image processing method as any of the above aspects is provided.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the electronic device performs the binocular image processing method of any of the above aspects.

According to the embodiment of the application, the first characteristic diagram is obtained by splicing the left-eye characteristic diagram and the right-eye characteristic diagram, the first characteristic diagram is subjected to grouping convolution processing based on the preset parallax layer progression to obtain the three-dimensional matching cost characteristic, the three-dimensional matching cost characteristic is subjected to cost aggregation processing based on the two-dimensional convolution network to obtain the target matching cost characteristic, and the parallax diagram corresponding to the binocular image to be processed is predicted based on the target matching cost characteristic, so that the parallax layer traversing and the dense memory access operation are avoided, the binocular image processing efficiency is improved, in addition, the whole processing can be realized based on the 2D convolution operation, the matching precision is ensured, and the deployment on various NPU chips is easier, and the real-time reasoning requirement under the embedded deployment scene is met.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a binocular image processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another binocular image processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart of another binocular image processing method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a binocular image processing model according to an embodiment of the present invention;

fig. 5 is a block diagram of a binocular image processing apparatus according to an embodiment of the present invention;

fig. 6 is a block diagram of a hardware structure of an electronic device according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It will be appreciated that in the specific embodiments of the present application, related data such as user information is involved, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

The deep learning-based binocular stereo matching process in the related art can be classified into the following three types for cost volume (cost volume) construction: a correlation volume (correlation volume), a connection volume (concatenation volume) and a combined volume (combined volume).

The correlation volume is a cost volume obtained by traversing each parallax level and carrying out correlation calculation on the left and right eye feature images, and the correlation can be any similarity measurement function, such as Euclidean distance, vector point multiplication or cosine. Thus, for each level of disparity, a map of the correlation can be obtained. Assuming that the parallax range isLeft target feature map->And Right eye feature map->The size of (2) is +.>The size of the relevant volume is +.>As shown in the following formula (1), the correlation calculation in the formula (1) adopts a point multiplication method.

（1）

The connection volume is a cost volume obtained by traversing each parallax value and performing conccate series operation on the left and right eye feature images. Assuming that the parallax range isLeft target feature map->And Right eye feature map->Is of the size ofThe resulting connection volume size is +. >As shown in the following formula (2).

（2）

In the two construction modes, the correlation volume directly adopts correlation calculation, so that direct information of similarity is obtained, but the information loss is serious because only one channel of information exists under each parallax level. In contrast, the connected volumes are at each parallax levelHas the following componentsThe dimensional channel, information retention is sufficient, however, because of the lack of correlation calculation, more convolution layers are subsequently needed to learn the portion of the content.

The combined volume is a cost volume which is constructed by combining the two volume construction modes, so that the advantages of the two construction modes of the related volume and the connection volume are utilized to a certain extent.

However, in any of the above three construction methods, each parallax level needs to be traversed, in the construction process, the right-eye feature map needs to be shifted to extract the corresponding visual features and participate in correlation calculation or be connected in series, and because the corresponding features need to be shifted to extract, intensive memory access operation is inevitably performed in the implementation, which is just the least good of the NPU chip, and the reasoning time is greatly increased. Second, while the way in which the volumes and combined volumes are connected can avoid information loss, the cost volume of the composition is a 4-dimensional tensor, as described above Therefore, a 3D convolution operation is required in the subsequent cost aggregation, and most NPU chips cannot support the 3D convolution operation. For the two reasons, the application of the related technology in an embedded deployment scene with real-time reasoning requirements is limited.

In view of this, the embodiment of the application provides a binocular image processing method, which avoids traversing parallax hierarchy and performing dense memory access operation, thereby improving reasoning efficiency; in addition, the obtained cost volume is a three-dimensional tensor, the aggregation processing can be performed by using a 2D convolution operation, and the limitation of the 3D convolution operation is avoided, so that the binocular image processing method provided by the embodiment of the application can be applied to binocular vision tasks in embedded equipment, such as robot vision, automatic driving and the like.

The embedded device may be an electronic device, the functions implemented by the method may be implemented by invoking program codes by a processor in the electronic device, and of course, the program codes may be stored in a computer storage medium, the program codes may be implemented by using programming languages such as c++, python, etc., and training and reasoning of the binocular image processing model in the implementation of the present application may be implemented by using a deep learning framework such as TensorFlow, pyTorch, etc.

It should be noted that the embodiments of the present application may be applied to various scenarios, including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, driving assistance, and the like.

Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Fig. 1 is a schematic flow chart of a binocular image processing method according to an embodiment of the invention. It is noted that the present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. In actual system or product execution, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment). As shown in fig. 1, the method may include:

s101, acquiring a binocular image to be processed, wherein the binocular image to be processed comprises a left-eye image and a right-eye image.

Specifically, the binocular image to be processed may be left and right view images of the same scene shot by the binocular camera, where the left view image may be a left view image, and the right view image may be a right view image.

In a specific implementation, the binocular image to be processed may be obtained by directly shooting the binocular image under the same scene through a binocular camera, or may be obtained by obtaining a pre-stored binocular image from a binocular image library as the binocular image to be processed, or may be obtained by receiving the binocular image sent by other electronic devices as the binocular image to be processed, or may be obtained by downloading the binocular image from a network as the binocular image to be processed.

And S103, respectively extracting the characteristics of the left eye image and the right eye image to obtain a left eye characteristic diagram and a right eye characteristic diagram.

The left eye feature map may be a shallow feature representation obtained by learning a left eye image by a two-dimensional convolution operation, and the right eye feature may be a shallow feature representation obtained by learning a right eye image by a two-dimensional convolution operation. In a specific implementation, the left eye image and the right eye image may be convolved by using a full convolutional neural network with shared parameters (including weight sharing) to obtain a left eye feature map and a right eye feature map with the same scale, for example, the left eye feature map isThe right eye feature map is->Wherein->Number of channels representing the feature map, < > and->Representing the width of the feature map, < >>Representing the height of the feature map.

In practical applications, the imaging planes of the cameras corresponding to the left-eye image and the right-eye image in the binocular image to be processed may not be aligned in parallel, so before the feature extraction is performed on the left-eye image and the right-eye image, the left-eye image and the right-eye image need to be subjected to stereo correction to transform the left-eye image and the right-eye image into a space with parallel optical axes, for example, the left-eye image and the right-eye image may be subjected to rotation operation through a rotation matrix so as to transform the left-eye image and the right-eye image into a space with parallel optical axes.

And S105, performing splicing processing on the left eye feature map and the right eye feature map to obtain a first feature map.

The stitching process may be stacking the left-eye feature map and the right-eye feature map in the channel dimension, for example, the left-eye feature map isThe right eye feature map is->Then the first feature map obtained by the splicing process is。

And S107, carrying out grouping convolution processing on the first feature map based on the preset parallax layer level number to obtain a three-dimensional matching cost feature.

The preset number of parallax layers refers to the number of all parallax layers in a preset parallax range, and generally the preset number of parallax layers can be represented by the preset parallax range.

Wherein, the three-dimensional matching cost characteristic represents the matching cost of the pixel points of the binocular image to be processed in each parallax layer level, and the three-dimensional matching cost characteristic is a 3-dimensional tensor, which can be expressed as). In practical applications, the three-dimensional matching cost feature may also be called cost volume (cost volume), which in the embodiment of the present application is a cost volume with a size of +.>Storing three-dimensional matrix of each pixel point in preset parallax range/>Matching cost value at each parallax level within.

The grouping number of the grouping convolution processing is determined based on the preset parallax layer number.

In some exemplary embodiments, as shown in fig. 2, the step S107 may include, when implemented:

s201, carrying out convolution processing on the feature graphs of the corresponding channels based on first convolution cores corresponding to the channels of the first feature graph respectively, and obtaining a plurality of channel convolution feature graphs.

It should be noted that, in the convolution processing process of one convolution kernel in the embodiment of the present application, a scanning window with the same size as the convolution kernel may slide on the feature map according to a target step length, and sequentially scan each area of the feature map, where the target step length may be set based on needs. Specifically, in the process of convolution operation, when the scanning window of the convolution kernel slides to any region of the feature map, the electronic device reads the values corresponding to each feature point in the region, performs dot multiplication operation on the convolution kernel and the values corresponding to each feature point, and then accumulates each product, and takes the accumulated result as a feature point. And then, sliding the scanning window of the convolution kernel to the next region of the feature map according to the target step length, performing convolution operation again, outputting a feature point until all regions of the feature map are scanned, and forming all the output feature points into a new feature map for output.

Therefore, the first convolution check performs convolution processing on the feature images of the corresponding channels in the first feature images, so that the transverse relative offset between the left-eye feature image and the right-eye feature image can be implicitly achieved.

S203, carrying out space dimension convolution processing on the plurality of channel convolution feature graphs based on the first number of second convolution kernels to obtain the first number of first convolution feature graphs.

Wherein the first number is twice the number of preset parallax layer stages.

Specifically, each second convolution kernel simultaneously carries out convolution processing of space dimensions on a plurality of channel convolution feature graphs, and the convolution processing result is a first convolution feature graph, so that information interaction among channels is realized.

And S205, stacking the first convolution feature images of the first number to obtain a second feature image.

S207, carrying out convolution processing on the second feature graphs based on a second number of third convolution kernels to obtain the second number of second convolution feature graphs.

The second number is the number of the preset parallax layer series, and the second number of the second convolution feature diagrams form the three-dimensional matching cost feature.

Specifically, the first convolution kernel may be 3×3 convolution kernels, e.g., the first feature map is I.e. the number of channels of the first profile is +.>Then each channel corresponds to a 3x3 convolution kernel, then there is +.>A first convolution kernel, each of which convolves the feature map of the corresponding channel, thereby obtaining +.>A characteristic map of the channel convolution.

Assume that the number of parallax layer levels is presetThen, use +.>The second convolution kernels are respectively for +.>Space dimension of each channel convolution characteristic diagramConvolution processing, i.e. each second convolution kernel is simultaneously applied to +.>The convolution processing is carried out on the convolution characteristic graphs of the channels, so that +.>And a second convolution characteristic diagram for realizing information interaction among channels. Wherein the size of the second convolution kernel may be a 1x1 convolution kernel.

Next, the process willThe convolution feature graphs are stacked with a channel number of +.>Second characteristic diagram->。

Next, use is made ofThe third convolution kernels are respectively corresponding to the second characteristic diagram +.>Performing convolution processing to obtain +.>A second convolution feature map, each second convolution feature map being a corresponding third convolution check second feature mapAnd performing convolution processing to obtain an output. Wherein the size of the third convolution kernel may be a 1x1 convolution kernel.

Further, the processing unit is used for processing the data,the three-dimensional matching cost feature formed by the second convolution feature graphs can be expressed as +. >。

In some exemplary embodiments, the step S207 may include, when implemented:

taking the second feature map as a current feature map, and sequentially carrying out depth convolution and point-by-point convolution on the current feature map to obtain a third feature map;

updating the current feature map to the third feature map, and executing the steps of sequentially carrying out depth convolution and point-by-point convolution on the current feature map until the updating times reach preset times to stop updating;

and respectively carrying out convolution processing on the third feature images when the updating is stopped based on the second number of third convolution kernels to obtain the second number of second convolution feature images.

Specifically, the preset number of times may be set based on actual needs, for example, the preset number of times may be 3 times.

In the embodiment, the depth convolution and the point-by-point convolution are performed on the second feature map for a plurality of times, so that the interaction degree of information in the channel dimension and the space dimension can be improved, the precision of three-dimensional matching cost features is improved, and the prediction accuracy of the parallax map is improved.

In some exemplary embodiments, the convolving the third feature map when the update is stopped based on a second number of third convolution kernels, respectively, and the obtaining the second number of second convolution feature maps may be: dividing the channels of the third feature map when the updating is stopped into the channel feature map groups of the second number; and respectively carrying out convolution processing on the second number of channel characteristic image groups based on the second number of third convolution kernels, and obtaining the second number of second convolution characteristic images based on the result of the convolution processing.

Exemplary second feature map as described aboveObtaining a third characteristic diagram when updating and stopping after the depth convolution and the point-by-point convolution of the preset times +.>According to a second number (i.e. the number of preset disparity layers +.>) For the third characteristic diagram->Is (i.e.)>Individual channels) can be divided to give +.>A set of channel feature maps; next, use +.>The third convolution kernel pairs +.>The channel characteristic image groups are subjected to convolution processing, namely, a third convolution kernel corresponds to one channel characteristic image group and is used for carrying out convolution processing on the channel characteristic image group, so that +_j can be obtained>The result of the convolution process, thereby based on the +.>The three-dimensional matching cost characteristic can be obtained by the results of the convolution processing. For example, the +.>The result of the convolution processing is formed as a three-dimensional matching cost feature, and the method can be further adopted for the +.>The results of the convolution processes are in one-to-one correspondence +.>The third convolution kernels are respectively opposite to each otherThe result of the corresponding convolution process is convolved and the final +.>The results of the individual convolution processes are formed as three-dimensional matching cost features.

In the embodiment, the plurality of channels of the third feature map are divided into the channel feature map sets with the preset parallax layer series when updating is stopped, and then the convolution check of the preset parallax layer series is adopted to independently carry out convolution processing on each channel feature map set so as to obtain the three-dimensional matching cost feature, so that the precision of the three-dimensional matching cost feature is further improved.

And S109, performing cost aggregation processing on the three-dimensional matching cost features based on a two-dimensional convolution network to obtain target matching cost features.

The two-dimensional convolution network may be a two-dimensional hourglass network (Hourglass Network) that may be composed of a stack of a plurality of hourglass modules, each of the hourglass modules being composed of a downsampling module, an upsampling module, and a residual module. The downsampling module reduces input features through convolution and pooling operations, so that a low-resolution feature map with stronger semantic information is obtained. The up-sampling module amplifies the low-resolution feature map to the original resolution and fuses the low-resolution feature map with the corresponding high-resolution input features, so that richer feature information is obtained. The residual error module is used for improving the depth of the network and further improving the characteristic representation capability.

Specifically, in order to enable the matching cost value to reflect the correlation among pixels more accurately, the relation between adjacent pixels can be established by carrying out cost aggregation processing on the three-dimensional matching cost features based on a two-dimensional convolution network, and then the accuracy of the matching cost value is improved by utilizing the relation between the adjacent pixels, so that more accurate and optimized target matching cost features are obtained. Assume that the three-dimensional matching cost is characterized by The target matching cost feature may be expressed as +.>。

S111, predicting a parallax map corresponding to the binocular image to be processed based on the target matching cost characteristics.

The size of the parallax image is consistent with the size of the binocular image to be processed, and the pixel value of the pixel point in the parallax image records the parallax value corresponding to the pixel point in the binocular image to be processed.

Specifically, a weighted average value of probabilities of different parallax levels corresponding to each pixel point can be determined based on the target matching cost feature, and then the weighted average value is determined as the parallax value of the pixel point, so that a parallax map is obtained. Wherein, the weighted average value of the probabilities of different parallax levels corresponding to each pixel point can be calculated based on the following formula (3):

（3）

wherein,is greater than or equal to 0 and less than +.>Natural number of (3); />The maximum parallax level corresponding to the preset parallax range is set; />Representing parallax level +.>The corresponding probability can be obtained by processing the target matching cost feature through a softmax function.

According to the embodiment of the application, a full convolution mode is adopted, the first feature map is subjected to grouping convolution processing based on the preset parallax layer progression to obtain the three-dimensional matching cost feature, then the three-dimensional matching cost feature is subjected to cost aggregation processing based on the two-dimensional convolution network to obtain the target matching cost feature, and the parallax map corresponding to the binocular image to be processed is predicted based on the target matching cost feature, so that the parallax layer traversing operation and the dense memory access operation are avoided, the binocular image processing efficiency is improved, in addition, the whole processing can be realized based on the 2D convolution operation, the matching precision is ensured, and the deployment on various NPU chips is easier, and the real-time reasoning requirement under an embedded deployment scene is met.

In some exemplary embodiments, in order to improve the prediction accuracy of the disparity map, as shown in fig. 3, the step S103 may include:

and S301, extracting the characteristics of the left-eye image to obtain a left-eye characteristic diagram of a first scale.

And S303, extracting the characteristics of the right eye image to obtain a right eye characteristic image of the first scale.

And S305, performing downsampling processing on the left eye feature map of the first scale to obtain a downsampled left eye feature map of at least one scale.

The first scale may be set based on actual needs, and may be the same size as or different from the binocular image to be processed.

In a specific implementation, the left eye feature map of the first scale can be downsampled into a feature map of 1/2 resolution and a feature map of 1/4 resolution, so that two downsampled left eye feature maps with smaller scales can be obtained.

Based on this, the step S105 may be implemented as follows: and performing splicing treatment on the left eye feature map of the first scale and the right eye feature map of the first scale to obtain a first feature map.

Based on this, with continued reference to fig. 3, the step S109 may include, when performing cost aggregation processing on the three-dimensional matching cost feature based on the two-dimensional convolution network, to obtain a target matching cost feature:

S307, the left eye feature map of the first scale, the downsampled left eye feature map of each scale and the three-dimensional matching cost feature are respectively input into a two-dimensional hourglass network as input features, and the two-dimensional hourglass network carries out convolution processing on each input feature to obtain a convolution result corresponding to each input feature.

S309, fusing convolution results corresponding to the input features to obtain target matching cost features.

The fusing of the convolution results corresponding to the input features may be directly adding the convolution results corresponding to the input features, so as to obtain the target matching cost feature.

In the embodiment, when the cost aggregation processing is carried out on the three-dimensional matching cost features based on the two-dimensional convolution network, the information of different depths of the left-eye image is fused, so that wider receptive fields and semantic features can be provided, the cost aggregation is better facilitated, and the accuracy of the target matching cost features is improved.

The binocular image processing method of the embodiment of the application can be realized based on a binocular image processing model, and the training of the binocular image processing model comprises the following steps:

obtaining a sample binocular image and a standard parallax image corresponding to the sample binocular image, wherein the sample binocular image comprises a sample left-eye image and a sample right-eye image;

Respectively carrying out feature extraction on the sample left-eye image and the sample right-eye image based on a feature extraction network of the binocular image processing model to obtain a sample left-eye feature map and a sample right-eye feature map;

splicing the sample left-eye feature map and the sample right-eye feature map to obtain a first sample feature map;

inputting the first sample feature map into a grouping convolution network of the binocular image processing model, and performing grouping convolution processing on the first sample feature map by the grouping convolution network based on a preset parallax layer level number to obtain sample three-dimensional matching cost features; the three-dimensional matching cost characteristic of the sample characterizes the matching cost of the pixel points of the binocular image of the sample at each parallax layer level.

And carrying out cost aggregation treatment on the three-dimensional matching cost features of the sample based on the two-dimensional convolution network of the binocular image processing model to obtain target matching cost features of the sample.

And predicting a sample parallax image corresponding to the sample binocular image based on the sample target matching cost characteristic, and adjusting model parameters of the binocular image processing model based on the difference between the sample parallax image and the standard parallax image until a preset training ending condition is reached.

The standard parallax image corresponding to the sample binocular image is used for recording the real parallax value corresponding to the pixel point in the sample binocular image.

The feature extraction network in the binocular image processing model may be a convolutional neural network that shares parameters.

Specifically, the model parameters of the binocular image processing model are adjusted based on the difference between the sample parallax map and the standard parallax map by using a preset loss function, a loss value is determined based on the difference between the sample parallax map and the standard parallax map, further the model parameters of the binocular image processing model are reversely adjusted based on the loss value, and then the iterative training is continued based on the binocular image processing model after the model parameter adjustment. The preset loss function may be a smoothed L1 loss function, and the loss value may be expressed as the following formula (4):

（4）

of the formula (I)Wherein N represents the number of pixel points; />Representing pixel dot +.>Corresponding standard disparity values in the standard disparity map; />Representing pixel dot +.>A predicted disparity value corresponding to the sample disparity map; />Representation->。

The preset training ending condition may be that the iteration number reaches a preset iteration number, the loss value reaches a preset loss threshold, or the difference between the loss values of two adjacent iterations reaches a preset loss difference threshold.

In some exemplary embodiments, the packet convolutional network includes a first packet convolutional network and a second packet convolutional network, the first packet convolutional network includes a first convolutional network and a second convolutional network, the number of first convolutional kernels in the first convolutional network is consistent with the channel number of the first sample feature map, the number of second convolutional kernels in the second convolutional network is twice the preset number of parallax layer levels, and the number of third convolutional kernels in the second packet convolutional network is twice the preset number of parallax layer levels. Furthermore, when the grouping convolution network performs grouping convolution processing on the first sample feature map based on a preset parallax level, the method may include:

performing convolution processing on the feature images of the corresponding channels based on first convolution cores in the first convolution network, which correspond to the channels of the first sample feature images respectively, so as to obtain a plurality of sample channel convolution feature images;

performing space dimension convolution processing on the plurality of sample channel convolution feature graphs based on each second convolution kernel in the second convolution network to obtain a first sample convolution feature graph corresponding to each second convolution kernel;

Stacking the first sample convolution feature graphs corresponding to the second convolution kernels to obtain second sample feature graphs;

respectively carrying out convolution processing on the second sample feature images based on each third convolution kernel in the second grouping convolution network to obtain second sample convolution feature images of the preset parallax layer series;

and the second sample convolution feature map of the preset parallax layer series forms the sample three-dimensional matching cost feature.

Specifically, when the convolution processing of the spatial dimension is performed on the plurality of sample channel convolution feature graphs based on each second convolution kernel in the second convolution network, each second convolution kernel simultaneously performs the convolution processing of the spatial dimension on the plurality of sample channel convolution feature graphs, and a result of the convolution processing is a first sample convolution feature graph, so that information interaction among channels is realized.

In some exemplary embodiments, the packet convolutional network further includes a plurality of cascaded third packet convolutional networks, and further performing, based on each third convolution kernel in the second packet convolutional network, convolution processing on the second sample feature map, to obtain a second sample convolution feature map of the preset parallax layer level may include:

Sequentially convolving the second sample feature map based on the plurality of cascaded third packet convolutional networks; each third grouping convolution network sequentially executes depth convolution and point-to-point convolution on respective current input;

and taking the convolution processing result of the last third packet convolution network as the input of the second packet convolution network, and respectively carrying out convolution processing on the convolution processing result by each third convolution kernel in the second packet convolution network to obtain a second sample convolution characteristic diagram of the preset parallax layer series.

In the embodiment, the convolution processing including the depth convolution and the point-by-point convolution is sequentially performed on the second sample feature map through the plurality of cascaded third grouping convolution networks, so that the interaction degree of information in the channel dimension and the space dimension can be improved, the precision of three-dimensional matching cost features of samples is further improved, the training effect of the binocular image processing model is improved, and the prediction precision of the binocular image processing model on the parallax map is improved.

In some exemplary embodiments, the performing, by each third convolution kernel in the second packet convolution network, convolution processing on the convolution processing result to obtain a second sample convolution feature map of the preset parallax layer number may include:

Dividing a plurality of channels of the convolution processing result based on the number of third convolution kernels in the second grouping convolution network to obtain a plurality of sample channel characteristic diagram groups;

and respectively inputting the plurality of sample channel feature graphs to corresponding third convolution kernels, carrying out convolution processing on the sample channel feature graphs input by the corresponding third convolution kernels, and obtaining a second sample convolution feature graph of the preset parallax layer series based on the result of the convolution processing of each third convolution kernel.

The embodiment further improves the precision of the three-dimensional matching cost characteristics of the sample, is beneficial to improving the training effect of the binocular image processing model and improves the prediction precision of the binocular image processing model on the parallax image.

In some exemplary embodiments, in order to further improve the training effect of the binocular image processing model, when feature extraction is performed on the basis of the feature extraction network of the binocular image processing model, feature extraction may be performed on the sample left-eye image and the sample right-eye image based on the feature extraction network to obtain a sample left-eye feature map of a first scale and a sample right-eye feature map of a first scale, and the first sample feature map may be obtained by performing stitching processing on the sample left-eye feature map of the first scale and the sample right-eye feature map of the first scale.

And for the sample left-eye feature map of the first scale, downsampling processing can be further performed to obtain a downsampled sample left-eye feature map of at least one scale, for example, the sample left-eye feature map of the first scale can be downsampled to 1/2 resolution and 1/4 resolution, further two downsampled left-eye feature maps with smaller scales can be obtained, and the downsampled sample left-eye feature map of the at least one scale can be applied to subsequent cost aggregation processing to provide wider receptive fields and semantic features, so that better cost aggregation of a binocular image processing model is facilitated, and the training effect of the model is improved.

In order to facilitate understanding of the technical solution of the embodiment of the present application, the following description is made with reference to the structure of the binocular image processing model shown in fig. 4.

As shown in fig. 4, the binocular image processing model includes a feature extraction network for feature extraction, a packet convolution network for constructing cost volumes, and a two-dimensional convolution network for cost aggregation.

Wherein, the feature extraction network outputs a left eye feature map of a first scaleAnd the right eye feature map of the first scale is +.>In addition, two downsampled left eye feature graphs with smaller scales are output through convolution processing respectively And->. In the subsequent cost aggregation phase, < > and->、、/>And the constructed three-dimensional cost volumes are used as input features of a two-dimensional convolution network for cost aggregation to carry out cost aggregation processing.

Wherein the packet convolutional network for constructing the cost volume comprises a first packet convolutional network, a plurality of cascaded third packet convolutional networks, and a second packet convolutional network.

The first grouping convolution network comprises a first convolution network and a second convolution network which are connected in sequence, and the input of the first convolution network is splicingAnd->The first characteristic diagram obtained->The first convolutional network comprises +.>A convolution kernel of 3x3, each 3x3 convolution kernel corresponding to the first feature map +.>For convolving the characteristic map of the channel, whereby +.>Convolution result, the->The convolution result stack is used as the input characteristic diagram of the second convolution network +.>. Specifically, in->The input to the second convolutional network may also be sequentially processed using a normalization layer (Batch Normalization, BN) and an activation function layer (Rectified Linear Unit, reLU), where the activation function may be a LeakyRelu activation function.

The second convolution network comprises （/>Representing a preset number of parallax layer levels) of 1x 1-sized convolution kernels, each 1x1 convolution kernel being used for the +.>Roll for space dimensionThe product is processed so that +.>Convolution result, the->The convolution result stack is used as the input characteristic diagram of the first third group convolution network>. Specifically, in->The normalization layer and the activation function layer may also be utilized to perform sequential processing before being input to the third packet convolutional network.

Each third packet convolution network comprises a depth convolution layer and a point-by-point convolution layer, wherein the depth convolution layer is used for carrying out depth convolution and comprisesA convolution kernel of 3x 3; the point-by-point convolution layer is used for point-by-point convolution and comprises +.>A convolution kernel of size 1x 1. Specifically, the depth convolution layer may further include a normalization layer and an activation function layer, and the point-by-point convolution layer may further include a normalization layer and an activation function layer. Multiple cascaded third packet convolutional network pairsAfter convolution processing, the output of the last third packet convolution network is +.>As an input profile for a second packet convolutional network. Illustratively, the plurality of concatenated third packet convolutional networks may be 3.

The second packet convolutional network comprisesA convolution kernel of size 1x1, ">Is->The individual channels are divided into AND->One-to-one correspondence of 1x1 convolution kernels>A plurality of groups, each of the 1x1 convolution kernels for convolving the feature maps in the corresponding group, whereby +_j can be obtained therefrom>A convolution result based on the ∈>The convolution results can form the initial cost volume +.>I.e. three-dimensional matching cost features. In particular, the second packet convolutional network may also be based on the normalization layer and the activation function layer pair +.>The convolution results are processed.

For example, as shown in fig. 4, a plurality of the second packet convolutional networks connected in sequence may be provided, each of the second packet convolutional networks performs the convolutional processing, and the output result of the last second packet convolutional network is taken as the initial cost volumeI.e. three-dimensional matching cost features. For example, two second packet convolutional networks connected in sequence may be provided.

In the cost aggregation stage, a two-dimensional hourglass network is adopted and is based on、/>、Cost volume->Performing cost aggregation processing to obtain output target three-dimensional matching cost characteristics +.>. Then, in the parallax prediction stage, the cost characteristic based on target three-dimensional matching can be- >A corresponding disparity map is predicted.

The embodiment builds the three-dimensional cost volume in a full convolution mode, realizes the implicit construction of the cost volume, avoids traversing each parallax level and performing intensive memory access operation, thereby improving the binocular image processing efficiency, and the obtained cost volume is still a three-dimensional tensor, so that the whole model can be completed by only using 2D convolution while retaining complete information, the limitation of 3D convolution operation is avoided, the model is easier to deploy on various NPU chips, the speed advantage is realized, the loss in accuracy is avoided, and the method is suitable for scenes with higher real-time requirements.

The present invention also provides a binocular image processing apparatus corresponding to the binocular image processing method provided in the above embodiments, and since the binocular image processing apparatus provided in the embodiment of the present invention corresponds to the binocular image processing method provided in the above embodiments, implementation of the binocular image processing method described above is also applicable to the binocular image processing apparatus provided in the embodiment, and will not be described in detail in the embodiment.

Referring to fig. 5, a schematic structural diagram of a binocular image processing apparatus according to an embodiment of the present invention is shown, where the apparatus has a function of implementing the binocular image processing method in the foregoing method embodiment, and the function may be implemented by hardware or implemented by executing corresponding software by hardware. As shown in fig. 5, the binocular image processing apparatus 500 may include:

An image acquisition module 510, configured to acquire a binocular image to be processed; the binocular image to be processed comprises a left-eye image and a right-eye image;

the feature extraction module 520 is configured to perform feature extraction on the left-eye image and the right-eye image, so as to obtain a left-eye feature map and a right-eye feature map;

the feature stitching module 530 is configured to stitch the left-eye feature map and the right-eye feature map to obtain a first feature map;

the three-dimensional matching cost determining module 540 is configured to perform a grouping convolution process on the first feature map based on a preset parallax layer level to obtain a three-dimensional matching cost feature; the three-dimensional matching cost characteristics represent the matching cost of the pixel points of the binocular image to be processed at each parallax layer level;

the cost aggregation module 550 is configured to perform cost aggregation processing on the three-dimensional matching cost feature based on a two-dimensional convolution network, so as to obtain a target matching cost feature;

and the disparity map prediction module 560 is configured to predict a disparity map corresponding to the binocular image to be processed based on the target matching cost feature.

In an exemplary embodiment, the three-dimensional matching cost determining module 540 includes:

In an exemplary embodiment, the third convolution processing module includes:

the updating module is used for updating the current feature map into the third feature map, and executing the steps of sequentially carrying out deep convolution and point-by-point convolution on the current feature map until the updating times reach preset times and stopping updating;

In an exemplary embodiment, the feature extraction module 520 includes:

Correspondingly, the feature stitching module 530 is specifically configured to: and performing splicing treatment on the left eye feature map of the first scale and the right eye feature map of the first scale to obtain a first feature map.

In an exemplary embodiment, the cost aggregation module 550 is specifically configured to: respectively inputting the left eye feature map of the first scale, the downsampled left eye feature map of each scale and the three-dimensional matching cost feature as input features into a two-dimensional hourglass network, and carrying out convolution processing on each input feature by the two-dimensional hourglass network to obtain a convolution result corresponding to each input feature; and fusing convolution results corresponding to the input features to obtain target matching cost features.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

The embodiment of the application also provides an electronic device, which comprises a processor and a memory, wherein at least one instruction or at least one section of program is stored in the memory, and the at least one instruction or the at least one section of program is loaded and executed by the processor to realize any binocular image processing method provided by the embodiment of the method.

The memory may be used to store software programs and modules that the processor executes to perform various functional applications and data processing by executing the software programs and modules stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for functions, and the like; the storage data area may store data created according to the use of the device, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory may also include a memory controller to provide access to the memory by the processor.

The method embodiments provided by the embodiments of the present application may be performed in a computer terminal, a server, or a similar computing device, i.e., the electronic device may include a computer terminal, a server, or a similar computing device. Fig. 6 is a block diagram of a hardware structure of an electronic device for running a binocular image processing method according to an embodiment of the present application, and as shown in fig. 6, the internal structure of the electronic device may include, but is not limited to: processor, network interface and memory. The processor, the network interface, and the memory in the electronic device may be connected by a bus or other means, and in fig. 6 in the embodiment of the present disclosure, the connection by the bus is exemplified.

Among them, a processor (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of an electronic device. The network interface may optionally include a standard wired interface, a wireless interface (e.g., WI-FI, mobile communication interface, etc.). A Memory (Memory) is a Memory device in an electronic device for storing programs and data. It will be appreciated that the memory herein may be a high speed RAM memory device or a non-volatile memory device, such as at least one magnetic disk memory device; optionally, at least one memory device located remotely from the processor. The memory provides a storage space that stores an operating system of the electronic device, which may include, but is not limited to: windows (an operating system), linux (an operating system), android (an Android, a mobile operating system) system, IOS (a mobile operating system) system, etc., the application is not limited in this regard; also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. In the embodiment of the present disclosure, the processor loads and executes one or more instructions stored in the memory to implement the binocular image processing method provided in the above embodiment of the method.

Embodiments of the present application also provide a computer readable storage medium that may be disposed in an electronic device to store at least one instruction or at least one program for implementing a binocular image processing method, where the at least one instruction or the at least one program is loaded and executed by the processor to implement any of the binocular image processing methods provided in the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of an electronic device reads the computer instructions from a computer-readable storage medium, the processor executing the computer instructions to cause the electronic device to perform the binocular image processing method of any of the above aspects

Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A binocular image processing method, the method comprising:

2. The method of claim 1, wherein the performing a group convolution process on the first feature map based on a preset number of parallax layer levels to obtain a three-dimensional matching cost feature includes:

performing convolution processing on the feature images of the corresponding channels based on first convolution cores respectively corresponding to the channels of the first feature images to obtain a plurality of channel convolution feature images;

performing space dimension convolution processing on the channel convolution feature graphs based on a first number of second convolution kernels to obtain a first number of convolution feature graphs; the first number is twice the number of preset parallax layer stages;

stacking the first convolution feature images of the first quantity to obtain a second feature image;

respectively carrying out convolution processing on the second feature images based on a second number of third convolution kernels to obtain second convolution feature images of the second number; the second number is the preset parallax layer number;

3. The method of claim 2, wherein the convolving the second feature map based on a second number of third convolution kernels, respectively, to obtain the second number of second convolution feature maps comprises:

updating the current feature map to the third feature map, and executing the steps of sequentially carrying out depth convolution and point-by-point convolution on the current feature map until the updating times reach the preset times, and stopping updating;

4. The method of claim 3, wherein the convolving the third feature map at the time of stopping the update based on the second number of third convolution kernels to obtain the second number of second convolved feature maps, respectively, comprises:

dividing the channels of the third feature map when the updating is stopped into the channel feature map groups of the second number;

and respectively carrying out convolution processing on the second number of channel characteristic image groups based on the second number of third convolution kernels, and obtaining the second number of second convolution characteristic images based on the result of the convolution processing.

5. The method according to claim 1, wherein the feature extraction is performed on the left-eye image and the right-eye image to obtain a left-eye feature map and a right-eye feature map, respectively, including:

extracting features of the left eye image to obtain a left eye feature image of a first scale;

extracting features of the right eye image to obtain a right eye feature image of the first scale;

performing downsampling treatment on the left eye feature map of the first scale to obtain a downsampled left eye feature map of at least one scale;

the step of performing a stitching process on the left eye feature map and the right eye feature map to obtain a first feature map includes:

and performing splicing treatment on the left eye feature map of the first scale and the right eye feature map of the first scale to obtain a first feature map.

6. The method of claim 5, wherein the cost-aggregating the three-dimensional matching cost features based on the two-dimensional convolution network to obtain a target matching cost feature comprises:

respectively inputting the left eye feature map of the first scale, the downsampled left eye feature map of each scale and the three-dimensional matching cost feature as input features into a two-dimensional hourglass network, and carrying out convolution processing on each input feature by the two-dimensional hourglass network to obtain a convolution result corresponding to each input feature;

And fusing convolution results corresponding to the input features to obtain target matching cost features.

7. The method according to any one of claims 1-6, wherein the method is implemented based on a binocular image processing model, the training of the binocular image processing model comprising:

acquiring a sample binocular image and a standard parallax image corresponding to the sample binocular image; the sample binocular image comprises a sample left-eye image and a sample right-eye image;

inputting the first sample feature map into a grouping convolution network of the binocular image processing model, and performing grouping convolution processing on the first sample feature map by the grouping convolution network based on a preset parallax layer level number to obtain sample three-dimensional matching cost features; the three-dimensional matching cost characteristics of the sample represent the matching cost of the pixel points of the binocular image of the sample at each parallax layer level;

Performing cost aggregation processing on the three-dimensional matching cost features of the sample based on a two-dimensional convolution network of the binocular image processing model to obtain target matching cost features of the sample;

8. The method of claim 7, wherein the packet convolutional network comprises a first packet convolutional network and a second packet convolutional network; the first packet convolutional network comprises a first convolutional network and a second convolutional network, the number of first convolutional kernels in the first convolutional network is consistent with the channel number of the first sample feature map, the number of second convolutional kernels in the second convolutional network is twice the number of the preset parallax layer stages, and the number of third convolutional kernels in the second packet convolutional network is the number of the preset parallax layer stages;

the grouping convolution network performs grouping convolution processing on the first sample feature map based on a preset parallax layer level number, and the obtaining of the sample three-dimensional matching cost feature comprises the following steps:

and the second sample convolution feature map of the preset parallax layer series forms the three-dimensional matching cost feature.

9. The method of claim 8, wherein the packet convolutional network further comprises a plurality of cascaded third packet convolutional networks, the convolving the second sample feature map based on each of the third convolution kernels in the second packet convolutional network to obtain a second sample convolving feature map of the predetermined number of parallax layer stages, comprising:

10. The method according to claim 9, wherein the convolving result by each third convolution kernel in the second packet convolution network to obtain a second sample convolution feature map of the preset number of parallax layers, includes:

11. A binocular image processing apparatus, the apparatus comprising:

12. An electronic device, comprising a processor and a memory, wherein the memory stores at least one instruction or at least one program, and the at least one instruction or the at least one program is loaded and executed by the processor to implement the binocular image processing method of any of claims 1-10.

13. A computer readable storage medium, wherein at least one instruction or at least one program is stored in the computer readable storage medium, and the at least one instruction or the at least one program is loaded and executed by a processor to implement the binocular image processing method of any one of claims 1-10.

14. A computer program, characterized in that the computer program, when being executed by a processor, implements the binocular image processing method of any one of claims 1 to 10.