CN112949504A

CN112949504A - Stereo matching method, device, equipment and storage medium

Info

Publication number: CN112949504A
Application number: CN202110244418.7A
Authority: CN
Inventors: 俞正中; 戴齐飞; 艾新东; 赵勇; 李福池
Original assignee: Shenzhen Apical Technology Co ltd
Current assignee: Shenzhen Apical Technology Co ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-11
Anticipated expiration: 2041-03-05
Also published as: CN112949504B

Abstract

The invention discloses a stereo matching method, a stereo matching device, stereo matching equipment and a storage medium, wherein the method comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left feature map and a first right feature map which respectively correspond to the left image and the right image; inputting the first left feature map and the first right feature map to a convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map; and obtaining a stereogram according to the target left characteristic diagram and the target right characteristic diagram, thereby achieving more accurate matching and simultaneously improving the matching efficiency.

Description

Stereo matching method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of computer stereoscopic vision, in particular to a stereoscopic matching method, a stereoscopic matching device, stereoscopic matching equipment and a storage medium.

Background

With the rapid development of artificial intelligence and computer technology, the technology of measuring and judging by replacing human eyes with machine vision is becoming the research focus of people. The automatic production line can improve the production flexibility and the automation degree, and is particularly suitable for dangerous working environments which are not suitable for manual operation or occasions where manual vision is difficult to meet requirements. As an important branch of machine Vision, Binocular Stereo Vision (Binocular Stereo Vision) has the advantages of high efficiency, proper precision, simple system structure, low cost and the like, and has great application value in many directions such as virtual reality, robot navigation, non-contact measurement and the like.

Binocular stereo vision deals with the real world by simulating the human visual system, which mainly consists of 4 stages, which are: calibrating an off-line camera to obtain internal and external parameters, distortion coefficients and the like of the camera; correcting, namely removing the influence caused by optical distortion and changing the binocular camera into a standard mode; stereo matching to obtain a disparity map; and 3D distance calculation, namely calculating the actual depth information of the object according to the parallax map.

Stereo matching is a focus and difficulty of binocular stereo vision, and research on this is actively carried out at home and abroad at present, the input of stereo matching is two standardized left and right images which are different only in the horizontal direction, more specifically, a point on an actual object is assumed to be (x, y) at the imaging position of a left image, and is (a, b) at the imaging position of a right image, wherein x is less than or equal to a, and y is equal to b. The conventional convolutional neural network algorithm for stereo matching usually has poor performance at the edge of an object, and the phenomenon of mismatching occurs at the edge of the object, so that the progress and the efficiency of stereo matching are influenced.

Accordingly, there is a need for improvements and developments in the art.

Disclosure of Invention

The present invention provides a stereo matching method, apparatus, device and storage medium, aiming at the deficiencies of the prior art, so as to solve the technical problems that the convolutional neural network algorithm for stereo matching in the prior art is usually not well performed at the edge of an object, and the phenomenon of mismatching occurs at the edge of the object, thereby affecting the progress and efficiency of stereo matching.

In order to achieve the purpose, the invention adopts the following technical scheme:

in a first aspect, an embodiment of the present invention provides a stereo matching method, where the method includes:

acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image;

respectively extracting a first left feature map and a first right feature map which respectively correspond to the left image and the right image;

inputting the first left feature map and the first right feature map to a convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map;

and obtaining a perspective view according to the target left characteristic diagram and the target right characteristic diagram.

As a further improved technical solution, the convolutional neural network module includes a first module and a second module, where the first module and the second module each include a first unit, a second unit, a weight calculation unit, a multiplication unit, and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit, and the addition unit are connected in sequence.

As a further improved technical solution, the respectively inputting the first left feature map and the first right feature map into a convolutional neural network module to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

inputting the first left feature map and the second feature map into the first unit respectively to obtain a second left feature map corresponding to the first left feature map and a second right feature map corresponding to the first right feature map;

inputting the second left feature map and the second right feature map into the second unit respectively to obtain a third left feature map corresponding to the second left feature map and a third right feature map corresponding to the second right feature map;

inputting the third left feature map and the third right feature map into the weight calculation unit to respectively obtain a fourth left feature map and a fourth right feature map;

inputting the fourth left feature map and the second left feature map into the multiplication unit to obtain a fifth left feature map;

inputting the fourth right feature map and the second right feature map into the multiplication unit to obtain a fifth right feature map;

inputting the fifth left feature map and the first left feature map into the adding unit to obtain a target left feature map;

inputting the fifth right feature map and the first right feature map into the adding unit to obtain a target right feature map.

As a further improvement, the first unit includes two third units, and the third unit includes one 2D convolutional layer and 1 activation function.

As a further improved technical solution, the convolution kernel size of the 2D convolutional layer is 3 × 3; the activation function is ReLu.

As a further improved technical solution, the inputting the third left feature map and the third right feature map into the weight calculation unit to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

aiming at each pixel point on the third left characteristic diagram, calculating the pixel point and a pixel point which is in the same row on the third right characteristic diagram and meets the specified range to obtain a first pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the first pixel point, and obtaining a fourth left characteristic diagram;

and aiming at each pixel point on the third right characteristic diagram, calculating the pixel point and a pixel point which is in the same row on the third left characteristic diagram and meets the specified range to obtain a second pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the second pixel point, and obtaining a fourth right characteristic diagram.

As a further improved technical solution, the calculation formula of the weight is:

1-sigmoid (M), wherein M is the pixel point with the minimum difference.

In a second aspect, an embodiment of the present invention provides a stereo matching apparatus, where the apparatus includes:

the system comprises an acquisition module, a display module and a display module, wherein the acquisition module is used for acquiring an original image pair obtained by a binocular camera, and the image pair comprises a left image and a right image;

the extraction module is used for respectively extracting a first left feature map and a first right feature map which respectively correspond to the left image and the right image;

the data module is used for respectively inputting the first left feature map and the first right feature map into a convolutional neural network module so as to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map;

and the matching module is used for obtaining a three-dimensional map according to the target left characteristic map and the target right characteristic map.

In a third aspect, an embodiment of the present invention provides a stereo matching apparatus, where the apparatus includes: a processor and memory and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the stereo matching method as described in any one of the above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the steps in the stereo matching method described in any one of the above.

Has the advantages that: compared with the prior art, the invention provides a stereo matching method, a stereo matching device, stereo matching equipment and a storage medium, wherein the method comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left feature map and a first right feature map which respectively correspond to the left image and the right image; inputting the first left feature map and the first right feature map to a convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map; and obtaining a stereogram according to the target left characteristic diagram and the target right characteristic diagram, thereby achieving more accurate matching and simultaneously improving the matching efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a stereo matching method according to a preferred embodiment of the present invention;

FIG. 2 is a flowchart illustrating the whole implementation process of the preferred embodiment of the stereo matching method according to the present invention;

FIG. 3 is a schematic diagram of a convolutional neural network module in the stereo matching method provided in the present invention;

fig. 4 is a flowchart of a preferred embodiment of step S200 in the stereo matching method provided in the present invention;

FIG. 5 is a schematic structural diagram of a preferred embodiment of a stereo matching device according to the present invention;

fig. 6 is a schematic structural diagram of a preferred embodiment of the stereo matching apparatus provided in the present invention.

Detailed Description

The present invention provides a stereo matching method, device, apparatus and storage medium, and in order to make the objects, technical solutions and effects of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

The invention will be further explained by the description of the embodiments with reference to the drawings.

The present embodiment provides a stereo matching method, as shown in fig. 1 and fig. 2, the method includes:

s100, acquiring an original image pair obtained through a binocular camera, wherein the image pair comprises a left image and a right image.

In this embodiment, an original image pair is acquired by a binocular camera, wherein the image pair includes a left image and a right image, and the present invention performs corresponding calculation and processing on the left image and the right image through the following steps, so as to obtain a stereo image.

S200, respectively extracting a first left feature map and a first right feature map corresponding to the left image and the right image.

In the embodiment of the invention, two residual convolution neural networks are used for respectively extracting the image characteristics of the left image and the right image, the two residual convolution networks have the same structure and share network parameters; the residual convolutional neural network comprises a plurality of convolutional layers, a processing normalization layer and a nonlinear activation function layer are connected behind each convolutional layer, and after the left image and the right image are respectively input into the residual convolutional neural network, the left image and the right image are preprocessed through a plurality of convolutional layers at the front end, and the height and the width of the images are respectively reduced to one half of the original height and width; the convolutional layers at the end of the residual convolutional neural network use a hole convolutional network.

S300, respectively inputting the first left feature map and the first right feature map into a convolutional neural network module to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map.

In this embodiment, as shown in fig. 3, the convolutional neural network module includes a first module and a second module, where the first module and the second module each include a first unit, a second unit, a weight calculation unit, a multiplication unit, and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit, and the addition unit are sequentially connected.

Further, the respectively inputting the first left feature map and the first right feature map into a convolutional neural network module to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

Specifically, the first unit includes two third units, the third unit includes one 2D convolutional layer and 1 activation function, the convolutional kernel size of the 2D convolutional layer is 3 × 3, and the number of output channels is 16; the activation function is ReLu. In this embodiment, the first left feature map and the first right feature map are aggregated by two 2D convolutional layers with convolutional kernel sizes of 3 × 3 and output channel numbers of 16 and an activation function, so as to obtain a second left feature map and a second right feature map. In practical application, the left and right feature maps (the first left feature map and the first right feature map) of any intermediate layer in the convolutional neural network feature extraction stage are input, and the sizes of the left and right feature maps are C × H × W, and C, H, W respectively represent the number of channels, height and width of the feature maps. And (3) aggregating information by passing the obtained second left feature map and second right feature map through 1 2D convolutional layer, then respectively passing through a special convolutional layer with the output channel number of 1 and the convolutional kernel size of 1 to reduce the channel number to 1 so as to obtain two corresponding intermediate feature maps (a third left feature map and a third right feature map), passing through a special convolutional layer with the convolution kernel size of 1 and the output channel number of 1 to obtain an intermediate feature map, wherein the output channel number is 1, the output channel number is the same as that of the original feature map, and the values of all points on the two output feature maps are both more than or equal to 0 due to the passage of an activation function ReLu. The two intermediate characteristic diagrams are used for calculating the weight matrix, and the calculation speed of the weight matrix is greatly increased compared with the left characteristic diagram and the right characteristic diagram which are not reduced to 1 in number due to the fact that the number of channels is 1.

In practical application, the convolutional neural network module replaces the first 3 layers of the feature extraction network of the PSmNet, and the number of output channels can be 32; if the convolutional neural network module replaces the rear 3 layers of the feature extraction network of the PSMNet, the number of output channels may be 128; in a preferred embodiment, the convolutional neural network module of the present application replaces the last 6 layers of the feature extraction network of PSMNet, and then the number of output channels is 128. It should be noted that the number of output channels may be changed according to actual requirements, and the present invention is not limited thereto.

Further, please refer to fig. 4 for a detailed process, which is a flowchart of step S300 in the stereo matching method according to the present invention.

As shown in fig. 4, the inputting the third left feature map and the third right feature map into the weight calculation unit to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

s301, aiming at each pixel point on the third left feature map, calculating the pixel point and a pixel point which is in the same row on the third right feature map and meets an appointed range to obtain a first pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the first pixel point, and obtaining a fourth left feature map;

s302, aiming at each pixel point on the third right feature map, calculating the pixel point and a pixel point which is in the same row on the third left feature map and meets the specified range to obtain a second pixel point with the minimum difference value with the pixel point, calculating the weight of the pixel point according to the second pixel point, and obtaining a fourth right feature map.

In this embodiment, the calculation formula of the weight is: 1-sigmoid (M), wherein M is the pixel point with the minimum difference.

Specifically, the present invention calculates the left and right weight matrices from the two middle feature maps (the third left feature map and the third right feature map) by searching for the point whose absolute value subtracted from the point is the smallest in an artificially specified range for all points on each middle feature map in the same row of the other middle feature map, so that the point whose absolute value subtracted from the point is the smallest "resembles" the point, and calculates the weight of the point based on the smallest value, and the weight of the point is 1-sigmoid (M) if the smallest value is the larger the weight of the point is, assuming that the smallest value is M. Optionally, the specified range is initially 192, that is, the range searched initially is 192 pixel points, and the specified range is synchronously adopted along with down-sampling of the network, and the initial range is a hyper-parameter and can be adjusted according to actual requirements.

Further, in this embodiment, the fourth left feature map and the second left feature map are input to the multiplication unit, the multiplication unit performs a dot product pixel by pixel on the fourth left feature map and the second left feature map to obtain a preliminary output feature map, since the calculated fourth feature map and the second left feature map have the same size in any channel, assuming that the number of channels in the second left feature map is C, the dot product operation is to copy the fourth left feature map C times in channel dimension to make them have the same size, then directly multiply the elements in corresponding positions, and then suppress features including parallax edges of the two input feature maps according to the result. Similarly, the fourth right feature map and the second right feature map are input to the multiplication unit, the multiplication unit performs pixel-by-pixel dot product on the fourth right feature map and the second right feature map to obtain a preliminary output feature map, and since the calculated fourth right feature map and the second right feature map have the same size in any channel, assuming that the number of channels in the second right feature map is C, the dot product operation is to copy the fourth feature map C in channel dimension to make them have the same size, then directly multiply the elements in corresponding positions, and then suppress features including parallax edges of the two input feature maps according to the result.

Further, the final output feature maps (the target left feature map and the target right feature map) are obtained by adding the original values of the two preliminary output feature maps (the fifth left feature map and the fifth right feature map) and the original values of the two input feature maps (the first left feature map and the first right feature map). Since the preliminary output feature map is exactly the same size as the input feature map, the operation is to add the points of the corresponding positions directly. This maintains the residual structure of the conventional convolutional neural network module, and has the advantage of facilitating the backward propagation of the deep neural network.

In this embodiment, the original values of the two input feature maps are added pixel by pixel to construct a residual structure, and the output of the residual structure is the output of the module. The module improves the performance of the parallax estimation network at the edge of an object by inhibiting the characteristics containing the parallax edge, and is constructed based on a commonly-used convolutional neural network module, after the commonly-used module is replaced by the module, the network can reduce the inference speed to a very low degree and improve the accuracy to a higher degree. The conventional convolutional neural network module is widely applied to each task, and calculation and application of a weight matrix with extremely low time consumption are additionally added for the stereo matching task on the premise that the structure of the conventional module is not changed. When the KITTI2015 picture set is subjected to stereo matching, the accuracy of the PSmNet using the module provided by the invention is improved by 8.62% of the original accuracy, and the time required by calculation is only improved by 0.016 s.

And S400, obtaining a perspective view according to the target left characteristic diagram and the target right characteristic diagram.

In this embodiment, a perspective view is finally obtained according to the target left feature map and the target right feature map. It should be noted that this step is prior art and is not described herein again.

In summary, compared with the prior art, the embodiment of the invention has the following advantages:

the invention discloses a stereo matching method, which comprises the following steps: acquiring an original image pair obtained by a binocular camera, wherein the image pair comprises a left image and a right image; respectively extracting a first left feature map and a first right feature map which respectively correspond to the left image and the right image; inputting the first left feature map and the first right feature map to a convolutional neural network module respectively to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map; and obtaining a stereogram according to the target left characteristic diagram and the target right characteristic diagram, thereby achieving more accurate matching and simultaneously improving the matching efficiency.

Based on the stereo matching method, the invention also provides a stereo matching device, as shown in fig. 5, the device comprises:

an acquisition module 41, configured to acquire an original image pair obtained by a binocular camera, where the image pair includes a left image and a right image;

an extracting module 42, configured to extract a first left feature map and a first right feature map corresponding to the left image and the right image, respectively;

a data module 43, configured to input the first left feature map and the first right feature map to a convolutional neural network module, respectively, so as to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map;

and the matching module 44 is configured to obtain a perspective view according to the target left feature map and the target right feature map.

It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation processes of the stereo matching apparatus and each module may refer to the corresponding descriptions in the foregoing stereo matching method embodiments, and for convenience and conciseness of description, no further description is provided herein.

The stereo matching apparatus may be implemented in the form of a computer program that can be run on the stereo matching device shown in fig. 6.

Based on the stereo matching method, the invention further provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the steps in the stereo matching method according to the above embodiment.

Based on the stereo matching method, the present invention also provides a stereo matching device, as shown in fig. 6, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory)22, and may further include a communication Interface (Communications Interface)23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logic instructions in the memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional application and data processing, i.e. implements the method in the above-described embodiments, by executing the software program, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

In addition, the specific processes loaded and executed by the instruction processors in the storage medium and the device are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A stereo matching method, characterized in that the method comprises:

2. The stereo matching method according to claim 1, wherein the convolutional neural network module comprises a first module and a second module, each of the first module and the second module comprises a first unit, a second unit, a weight calculation unit, a multiplication unit and an addition unit, and the first unit, the second unit, the weight calculation unit, the multiplication unit and the addition unit are connected in sequence.

3. The stereo matching method according to claim 2, wherein the step of inputting the first left feature map and the first right feature map into a convolutional neural network module, respectively, to obtain a target left feature map corresponding to the first left feature map and a target right feature map corresponding to the first right feature map specifically includes:

4. The stereo matching method according to claim 1, wherein the first unit comprises two third units, the third units comprising one 2D convolutional layer and 1 activation function.

5. The stereo matching method according to claim 4, wherein a convolution kernel size of the 2D convolution layer is 3 x 3; the activation function is ReLu.

6. The stereo matching method according to claim 1, wherein the inputting the third left feature map and the third right feature map into the weight calculation unit to obtain a fourth left feature map and a fourth right feature map respectively specifically includes:

7. The stereo matching method according to claim 4, wherein the weight is calculated by the formula:

1-sigmoid (M), wherein M is the pixel point with the minimum difference.

8. A stereo matching apparatus, characterized in that the apparatus comprises:

9. A stereo matching apparatus, characterized in that the apparatus comprises: a processor and memory and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the stereo matching method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps of the stereo matching method according to any one of claims 1 to 7.