CN110335228B

CN110335228B - Method, device and system for determining image parallax

Info

Publication number: CN110335228B
Application number: CN201810276957.7A
Authority: CN
Inventors: 张奎; 熊江; 杨平; 谢迪
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2018-03-30
Filing date: 2018-03-30
Publication date: 2021-06-25
Anticipated expiration: 2038-03-30
Also published as: CN110335228A

Abstract

The embodiment of the invention provides a method, a device and a system for determining image parallax, wherein an unsupervised neural network is used for determining the parallax among a plurality of images, the unsupervised neural network is trained by using a loss function, the real parallax is not required to be used as supervision, the loss function comprises one or more error parameters, and in the training process, the error parameters are gradually reduced, namely the accuracy for determining the parallax is high, so that the accuracy for determining the parallax by using the embodiment is high.

Description

Method, device and system for determining image parallax

Technical Field

The invention relates to the technical field of computer vision, in particular to a method, a device and a system for determining image parallax.

Background

The multi-view camera can simultaneously acquire a plurality of images of the same scene, and the visual angle range is enlarged. The binocular camera in the multi-view camera can also simulate the binocular vision of human eyes, and provides a better visual effect. Typically, it is necessary to calculate the parallax between the multiple images captured by the multi-view camera.

Schemes for computing disparity generally include: the real parallax is used as supervision information, a plurality of images collected by the multi-view camera are used as input, the neural network is trained, and the parallax between the plurality of images collected by the multi-view camera is calculated by the trained neural network. In this scheme, the real parallax needs to be obtained in advance, but the difficulty in obtaining the real parallax is high, and the accuracy of the generally obtained real parallax is low, so that the accuracy of the calculated parallax is low.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device and a system for determining image parallax so as to improve parallax accuracy.

In order to achieve the above object, an embodiment of the present invention provides a method for determining image parallax, including:

acquiring a plurality of images to be processed;

inputting the multiple images to be processed into an unsupervised neural network obtained by pre-training; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters;

extracting the features of the multiple images to be processed by using the feature extraction layer;

superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features;

coding the superposed features by using the feature coding layer to obtain coded features;

and performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed.

Optionally, the feature tensor dimension of each image to be processed is W × H × 3, where W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of color channels of the image to be processed;

the extracting, by using the feature extraction layer, the feature of the image to be processed may include:

for each image to be processed, the feature extraction layer is utilized to carry out convolution on the image to be processed to obtain the feature tensor dimensionality of

Wherein F represents the number of output channels of the feature extraction layer,x denotes a first preset downsampling multiple.

Optionally, the plurality of images to be processed include N images to be processed, where N is a positive integer; the step of superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features comprises the following steps:

two feature tensor dimensions corresponding to each pair of images to be processed are set as

The features of the image are superposed to obtain the dimension of the feature tensor of

The superimposed features of (1).

Optionally, the encoding, by using the feature encoding layer, the superimposed feature to obtain an encoded feature may include:

using the eigen coding layer to assign a dimension to the eigen tensor of

The superposed features are coded to obtain the feature tensor dimensionality of

Wherein C represents the number of output channels of the feature coding layer, y represents a second preset downsampling multiple, and y is greater than x.

Optionally, the parallax recovery layer includes a plurality of active two-dimensional deconvolution layers; the performing deconvolution operation on the encoded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed may include:

and in each activated two-dimensional deconvolution layer, obtaining the parallax under one scale by using a preset activation function.

Optionally, the preset loss function is a sum of loss values of a plurality of parallaxes obtained in the parallax recovery layer; the loss value comprises one or more of the following error parameters: image matching error parameters, parallax image smoothness error parameters and parallax image left-right consistency error parameters.

Optionally, the loss value of the parallax at one scale is an image matching error parameter, a first weight, a parallax map smoothness error, a second weight, a parallax map left-right consistency error parameter, and a third weight, where the first weight, and the third weight are preset.

Optionally, after the unsupervised neural network is obtained by training a plurality of groups of sample images by using a preset loss function, the method may further include:

determining an abnormal region in the resulting disparity of the unsupervised neural network output;

setting a new error parameter aiming at the abnormal area;

adding the new error parameter to the preset loss function to obtain a new loss function;

adjusting the obtained unsupervised neural network by using the new loss function and the determined abnormal area to obtain an adjusted unsupervised neural network;

the inputting the plurality of images to be processed into an unsupervised neural network obtained by pre-training comprises the following steps:

and inputting the plurality of images to be processed into the adjusted unsupervised neural network.

Optionally, the determining an abnormal region in the parallax of the obtained unsupervised neural network output may include:

determining an abnormal region in the obtained disparity output by the unsupervised neural network and a comparison region which is positioned on the same plane with the abnormal region;

the setting of the new error parameter for the abnormal region may include:

and calculating the plane distance between the comparison area and the abnormal area as a new error parameter.

Optionally, the acquiring the multiple images to be processed may include: acquiring a plurality of images acquired by a multi-view camera as images to be processed;

after obtaining the parallaxes of the multiple images to be processed, the method may further include:

and calculating the depth information of the multi-view camera according to the obtained parallax.

In order to achieve the above object, an embodiment of the present invention further provides an apparatus for determining image parallax, including:

the acquisition module is used for acquiring a plurality of images to be processed;

the input module is used for inputting the images to be processed into an unsupervised neural network obtained by pre-training; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters;

the extraction module is used for extracting the features of the images to be processed by utilizing the feature extraction layer;

the superposition module is used for superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features;

the coding module is used for coding the superposed features by utilizing the feature coding layer to obtain coded features;

and the parallax recovery module is used for performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed.

the extraction module may be specifically configured to: for each image to be processed, the feature extraction layer is utilized to carry out convolution on the image to be processed to obtain the feature tensor dimensionality of

Wherein F denotes the number of output channels of the feature extraction layer, and x denotes a first preset downsampling multiple.

Optionally, the plurality of images to be processed include N images to be processed, where N is a positive integer; the superimposing module may be specifically configured to:

The superimposed features of (1).

Optionally, the encoding module may be specifically configured to:

using the eigen coding layer to assign a dimension to the eigen tensor of

Optionally, the parallax recovery layer includes a plurality of active two-dimensional deconvolution layers; the parallax recovery module may be specifically configured to:

Optionally, the apparatus may further include:

a determining module for determining an abnormal region in the obtained disparity of the unsupervised neural network output;

the setting module is used for setting a new error parameter aiming at the abnormal area;

the adding module is used for adding the new error parameter into the preset loss function to obtain a new loss function;

the adjusting module is used for adjusting the obtained unsupervised neural network by utilizing the new loss function and the determined abnormal area to obtain an adjusted unsupervised neural network;

the input module is specifically configured to: and inputting the plurality of images to be processed into the adjusted unsupervised neural network.

Optionally, the determining module may be specifically configured to: determining an abnormal region in the obtained disparity output by the unsupervised neural network and a comparison region which is positioned on the same plane with the abnormal region;

the setting module may be specifically configured to: and calculating the plane distance between the comparison area and the abnormal area as a new error parameter.

Optionally, the obtaining module may be specifically configured to:

acquiring a plurality of images acquired by a multi-view camera as images to be processed;

the apparatus may further include:

and the calculating module is used for calculating the depth information of the multi-view camera according to the obtained parallax.

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor and a memory,

a memory for storing a computer program;

and a processor for implementing any one of the above-described methods for determining image parallax when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any one of the above image parallax determining methods.

In order to achieve the above object, an embodiment of the present invention further provides a system for determining image parallax, including: a multi-view camera and a processing device, wherein,

the multi-view camera is used for acquiring a plurality of images and sending the images to the processing equipment;

the processing device is used for receiving the plurality of images as a plurality of images to be processed; inputting the multiple images to be processed into an unsupervised neural network obtained by pre-training; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters; extracting the features of the multiple images to be processed by using the feature extraction layer; superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features; coding the superposed features by using the feature coding layer to obtain coded features; and performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed.

By applying the embodiment of the invention, the non-supervised neural network is utilized to determine the parallax between a plurality of images, the non-supervised neural network is trained by utilizing the loss function without using the real parallax as supervision, the loss function comprises one or more error parameters, and the error parameters are gradually reduced in the training process, namely the accuracy of determining the parallax is improved, so that the accuracy of determining the parallax by applying the scheme is higher.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating a method for determining image parallax according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a feature extraction layer in an unsupervised neural network according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a feature encoding layer and a disparity recovering layer in an unsupervised neural network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a part of a parallax recovery layer according to an embodiment of the present invention;

fig. 5 is a schematic diagram of a parallax error scene according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus for determining image parallax according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a system for determining image parallax according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the above technical problem, embodiments of the present invention provide a method, an apparatus, and a system for determining image parallax. The method and apparatus may be applied to various electronic devices with an image processing function, such as a mobile phone, a computer, and the like, or may also be applied to a camera with an image processing function, which is not limited specifically.

First, a method for determining image parallax according to an embodiment of the present invention will be described in detail.

Fig. 1 is a schematic flowchart of a method for determining image parallax according to an embodiment of the present invention, including:

s101: and acquiring a plurality of images to be processed.

The images to be processed are a plurality of images of which the parallax needs to be determined. For example, a plurality of images collected by the multi-view camera may be acquired as the image to be processed. The execution subject of the embodiment of the present invention may be the multi-view camera, or may be an electronic device communicatively connected to the multi-view camera.

As an implementation manner, the multi-view camera may be a binocular camera, the binocular camera may be a horizontal binocular camera, a vertical binocular camera, a pinhole binocular camera, a fisheye binocular camera, or the like, or the multi-view camera may be a three-view camera or a camera with more than three views, which is not limited specifically. And taking the left image and the right image acquired by the binocular camera as images to be processed.

S102: and inputting the multiple images to be processed into an unsupervised neural network obtained by pre-training, wherein the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer. The unsupervised neural network is as follows: the method comprises the steps that a plurality of groups of sample images are trained by utilizing a preset loss function, each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters.

The structure of the unsupervised neural network is preset, and the network parameters of the unsupervised neural network are obtained through training. Specifically, the network parameters of the unsupervised neural network can be initialized by using an Xavier parameter initialization method; and then, taking a plurality of groups of sample images as input, and training the network parameters of the unsupervised neural network by using a preset loss function to finally obtain the unsupervised neural network after training. Each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters. The training process is a process in which the loss function gradually decreases, that is, a process in which the error parameter gradually decreases, that is, a process in which the output parallax is gradually accurate.

If the unsupervised neural network is only used to determine the disparity of the images captured by the binocular cameras, each set of sample images may include two images, which may be the left and right images captured by the same binocular camera.

For example, the unsupervised neural network may be trained by using Adam parameter optimization method, or other optimization algorithms may also be used, which are not limited specifically. During training, the initial learning rate may be set to 10^-4(ii) a The size of the batch size is related to the image resolution and the display video memory, for example, when the Graphics card of NVIDIA TITAN X processes an image with a resolution of 640 × 480, the batch size of a single Graphics card may be the number of GPUs (Graphics Processing units) × 8, or in other scenarios, the batch size may also be the number of GPUs.

Assuming that 4 ten thousand sets of sample images are acquired and the number of times of training is set to 50 times, the learning rate can be reduced to one-half of the initial learning rate when the 30 th time of training is reached and the learning rate can be reduced to one-fourth of the initial learning rate when the 40 th time of training is reached. Those skilled in the art will appreciate that when training is performed to a value close to the optimal value, if the learning rate is large, the stability of the training result is poor, and therefore, the stability of the training result can be improved by reducing the learning rate during the training process.

Taking a group of sample images as an example, the training process of the unsupervised neural network is introduced below; supposing that the group of sample images are a left image and a right image acquired by the same binocular camera, the unsupervised neural network structurally comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer:

step one, inputting a left image and a right image into a feature extraction layer, wherein the feature tensor dimensions of the left image and the right image are W x H3, W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of color channels of the image to be processed, namely the number of RGB (Red, Green, Blue, Red, Green and Blue) channels.

Respectively convolving the left image and the right image in the feature extraction layer to obtain two feature tensor dimensions of

And (c) characterizing, wherein F represents the number of output channels of the feature extraction layer, and x represents a first preset down-sampling multiple.

For example, the feature extraction layer may be a 5-layer convolutional neural network, such as that shown in fig. 2, i.e., conv _ f1-conv _ f5 is the 5 two-dimensional convolutional layers, and ≦ indicates that the addition of two inputs and the BN (batch normalization) and ELU (Exponential Linear Unit) operations are performed, i.e., the convolutional layers are followed by a BN layer and an ELU layer. To reduce the loss of features, the second ≦ in fig. 2 may not perform BN and ELU operations.

In fig. 2, conv _ f1 may be a two-dimensional convolution layer with convolution kernel of 5 × 5 and step size of 2, and the resolution of the image to be processed may be reduced by this layer, that is, the image to be processed may be downsampled by this layer, so that memory occupation and calculation overhead may be reduced, and the receptive field of the convolution kernel during feature extraction may be increased, thereby better extracting the global features. Specifically, the first down-sampling multiple x may be preset, for example, x may be 2, that is, the resolution of the image to be processed is reduced by one half. To distinguish from the sampling multiple in the following, the sampling multiple in the feature extraction layer is referred to as a first sampling multiple x here.

The four two-dimensional convolutional layers conv _ f2-conv _ f5 may each be a two-dimensional convolutional layer with a convolution kernel of 3 × 3 and a step size of 1, and each of conv _ f2 and conv _ f4 may be followed by a BN (Batch Normalization) layer and an ELU activation layer. In fig. 2, the number of output channels of the 5 two-dimensional convolution layers is the same, or the number of output channels of each layer in the feature extraction layer is the same, and is denoted as F. F may be 32, or may be other, and is not particularly limited.

If x is 2 and F is 32, the feature tensor dimension output by the feature extraction layer is

The feature extraction layer extracts features of the left image and the right image to obtain two features, and the feature tensor dimensionality of each feature is

Step two, making two feature tensor dimensions into

The features extracted by the feature extraction layer are superposed to obtain superposed features, and the feature tensor dimensionality of the superposed features is

If x is 2 and F is 32, the feature tensor dimension of the superposed features is

Assuming that there are M images in a group of sample images, where the M images include N pairs of binocular images (a left image and a right image), M is a positive integer greater than 1, and N is a positive integer, the feature tensor dimension of the superimposed features is

As described above

The case is 1.

Step three, assuming that the feature tensor dimension is

The superposed features are input into a feature coding layer, and the superposed features are coded by using the feature coding layer to obtain the feature tensor dimension of

For example, the feature encoding layer may be the upper half of fig. 3, and each of the 6 frames including conv1-conv6 and conv1-conv6 includes two-dimensional convolutional layers with convolution kernels of 3 × 3, where conv1 includes two-dimensional convolutional layer steps of 1, and each of the two-dimensional convolutional layers included in conv2-conv6 includes a first two-dimensional convolutional layer step of 1 and a second two-dimensional convolutional layer step of 2. Each two-dimensional convolution layer in the feature encoding layer may be followed by a BN (Batch Normalization) layer and an ELU activation layer.

The overlapped features are down-sampled by the layer, so that the global features of the image can be obtained. Specifically, the second downsampling multiple y may be set in advance, for example, in fig. 3, the conv2-conv5 are downsampled once, and since the feature extraction layer is downsampled once, y may be 2⁶64. To distinguish from the sampling multiple in the above, the sampling multiple in the feature encoding layer is referred to herein as a second sampling multiple y.

The number of output channels per layer in the feature coding layer increases with the number of downsampling times, for example, conv2-conv5 in fig. 2 performs downsampling once, the number of input channels in the feature coding layer is 64, and the dimension of the input feature tensor is

After 5 downsampling, the number of output channels of the feature coding layer is 64 x 2⁵2048, the output feature tensor dimension is

And step four, inputting the coded features into a parallax recovery layer, and performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed.

For example, the parallax recovery layer may be the lower half of fig. 3, including upconv, upconv5-upconv1, conv. Wherein, upconv is a two-dimensional deconvolution layer with convolution kernel of 3 multiplied by 3 and step length of 2; each of the upconv4 and upconv5 frames contains two-dimensional deconvolution layers with convolution kernels of 3 × 3, and the step size of the first two-dimensional deconvolution layer is 1, and the step size of the second two-dimensional deconvolution layer is 2; the structure of each frame of upcon v3, upcon v2 and upcon v1 can be as shown in fig. 4; in fig. 3, conv is a two-dimensional deconvolution layer with convolution kernel of 3 × 3 and step size of 1, conv is an activated two-dimensional deconvolution layer, and in conv, a disparity at one scale can be obtained by using a sigmoid activation function, and the disparity can be a disparity map.

Conv in FIG. 4 is the same as conv in FIG. 3, and is an activated two-dimensional deconvolution layer; in fig. 4, iconv1 is a two-dimensional deconvolution layer having a convolution kernel of 3 × 3 and a step size of 1, and iconv2 is a two-dimensional deconvolution layer having a convolution kernel of 3 × 3 and a step size of 2.

In fig. 3, each frame in the upconv3-upconv1 contains one conv, and after the upconv1, there is one conv, that is, the parallax recovery layer contains 4 convs, so that 4 parallax maps can be obtained: disp1-disp 4. The 4 disparity maps are disparity maps at different scales, and the resolutions of the 4 disparity maps are different, wherein the high-resolution disparity map can better retain detailed information in an image, and the low-resolution disparity map can better recover global structure information of the disparity map.

For example, in fig. 3, except for the last convolution layer, other convolution layers may be followed by a BN layer and an ELU layer.

Step five, calculating the sum of the loss values of the 4 disparity maps by using a preset loss function, wherein the loss function can comprise three error parameters: image matching error parameters, parallax image smoothness error parameters and parallax image left-right consistency error parameters. The loss value of one disparity map is an image matching error parameter, a first weight, a disparity map smoothness error parameter, a second weight, a disparity map left-right consistency error parameter and a third weight, and the first weight, the first weight and the third weight are preset.

The loss value of a disparity map may be:

wherein L is_nThe nth disparity map is shown, and n can be 1, 2, 3, 4, w in the above example_imRepresents a first weight, w_dsRepresents a second weight, w_lrThe third weight is expressed, for example, the first weight may be 1.0, the second weight may be 0.85, and the third weight may be 1.0.

A parameter indicative of an error in the matching of the image,

representing a disparity map smoothness error parameter,

the disparity map left-right consistency error parameters are shown.

The left image match error parameter is represented,

the left image match error parameter is represented,

a disparity map smoothness error parameter representing the left image,

a disparity map smoothness error parameter representing the right image,

a disparity map left-right consistency error parameter representing a left image,

a disparity map left-right consistency error parameter representing a right image.

It is to be understood that if a set of sample images includes three images, each bracket of the preset loss function includes three terms, the first bracket includes a matching error parameter of the three images, the second bracket includes a disparity map smoothness error parameter of the three images, and the third bracket includes a disparity map left-right consistency error parameter of the three images. Similarly, the number of images in a set of sample images may be other, and is not listed.

Assuming that there are M images in a set of sample images, and the M images include N pairs of binocular images, the loss value of one disparity map may be:

wherein,

representing two optional pairs, D, from N pairs of binocular images_ld,D_rdRepresenting the absolute value of the difference between the disparity maps corresponding to the optional two pairs of binocular images. Namely, if the left disparity maps of the two groups of binocular images are respectively D_l1,D_l2Then D is_ld＝|D_l1-T₂₁D_l2L, where T₂₁Shows a parallax map D_l2To D_l1The transformation parameters can be obtained by calibration.

Assuming that a set of sample images includes a left image and a right image, the left image matching error parameter may be, for one embodiment:

wherein, N represents the number of all pixel points in the left image, ij represents the coordinate of one pixel point, and α represents the weight between the SSIM (structural similarity index) loss value of the left image and the difference loss value of the first-order image;

which represents the left image, is shown,

representing a reconstructed left image generated from a disparity map of the right and left images,

representing the first order color value difference of the left image and the reconstructed left image.

From the disparity map of the right image and the left image, the process of generating the reconstructed left image can be shown as follows:

which represents the right image, is shown,

representing the disparity value. The disparity value can be floating point type data, and a bilinear interpolation method can be used for image reconstruction.

The calculation process of the right image matching error parameter is similar to that of the left image matching error parameter, and is not repeated.

As described above, a group of sample images may include multiple images, and if there are M images in the group of sample images, and the M images include K pairs of binocular images, the matching error parameters of the left view in the N pairs of binocular images may be:

this equation can be understood as the sum of the left view image reconstruction errors in the K pairs of binocular images.

Assuming that a set of sample images includes a left image and a right image, as an embodiment, the disparity map smoothness error parameter of the left image may be:

wherein N represents the number of all pixel points in the left image, ij represents the coordinate of one pixel point,

a horizontal second-order gradient is represented,

representing a vertical second order gradient.

The purpose of setting the disparity map smoothness error parameter in the loss function is to make the disparity map as smooth as possible, that is, to minimize the sum of the gradients of the disparity map. However, because there is a disparity discontinuity at the edge of the image, i.e. there is a jump in disparity, the disparity map smoothness error parameter herein weights the gradients of the left image and the right image. The calculation process of the parallax image smoothness error parameter of the right image is similar to that of the parallax image smoothness error parameter of the left image, and the calculation process is not repeated.

As described above, a group of sample images may include multiple images, and if there are M images in the group of sample images, and the M images include K pairs of binocular images, the disparity map smoothness error parameters of the left view in the K pairs of binocular images may be:

assuming that a set of sample images includes a left image and a right image, as an embodiment, the left-right disparity map left-right consistency error parameter of the left image may be:

a disparity map representing the left image is shown,

a disparity map of the right image is shown. The calculation process of the left and right consistency error parameters of the disparity map of the right image is similar to that of the left and right consistency error parameters of the disparity map of the left image, and the calculation process is not repeated.

The purpose of setting disparity map left-right consistency error parameters in the loss function is to make the disparity maps of left and right images output by the unsupervised neural network consistent as much as possible. As described above, a group of sample images may include a plurality of images, and if there are M images in the group of sample images, and the M images include N pairs of binocular images, the left-right disparity error parameter of the left view in each of the K pairs of binocular images may be:

various loss values in the loss function can be derived, and network parameters of the unsupervised neural network are updated iteratively by using an Adam parameter optimization method. During the training process, the value of the loss function becomes smaller and smaller. For example, the number of times of training may be set, and when the number of times of training is reached, the training is completed. Then, the trained unsupervised neural network can be used for processing the image to be processed. It should be noted that the unsupervised neural network is trained in advance before S101.

S103: and extracting the features of the multiple images to be processed by utilizing the feature extraction layer.

In general, the feature tensor dimension of an RGB image is H × W × 3, where W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of RGB channels of the image to be processed; as an embodiment, S103 may include:

For example, the feature extraction layer may be a 5-layer convolutional neural network, such as that shown in fig. 2, i.e., conv _ f1-conv _ f5 is the 5 two-dimensional convolutional layers, i.e., adding two inputs and performing BN (batch normalization) and ELU (Exponential Linear Unit) operations, i.e., performing an ELU operation, i.e., a convolutional layer followed by a BN layer and an ELU layer. To reduce the loss of features, the second ≦ in fig. 2 may not perform BN and ELU operations.

In fig. 2, conv _ f1 may be a two-dimensional convolution layer with convolution kernel of 5 × 5 and step size of 2, and the resolution of the image to be processed may be reduced by this layer, that is, the image to be processed may be down-sampled by this layer, so that the memory usage and calculation overhead may be reduced, and the receptive field of feature extraction may be increased, thereby better extracting the global features. Specifically, the first down-sampling multiple x may be preset, for example, x may be 2, that is, the resolution of the image to be processed is reduced by one half. To distinguish from the sampling multiple in the following, the sampling multiple in the feature extraction layer is referred to as a first sampling multiple x here.

The four two-dimensional convolutional layers conv _ f2-conv _ f5 may each be a two-dimensional convolutional layer with a convolution kernel of 3 × 3 and a step size of 1, and each of conv _ f2 and conv _ f4 may be followed by a BN (Batch Normalization) layer and an ELU activation layer. In fig. 2, the number of channels of the 5 two-dimensional convolution layers is the same, or the number of output channels of each layer in the feature extraction layer is the same, and is denoted as F. F may be 32, or may be other, and is not particularly limited.

The feature extraction layer performs feature extraction on a plurality of images to be processed to obtain a plurality of features, and the feature tensor dimensionality of each feature is

For example, the extracted multiple features may be multiple feature images.

S103: and overlapping the features extracted by the feature extraction layer by using the feature overlapping layer to obtain the overlapped features.

Continuing the above embodiment, a plurality of feature tensor dimensions of

Assuming that the plurality of images to be processed include N images to be processed, where N is a positive integer, S103 may include: two feature tensor dimensions corresponding to each pair of images to be processed are set as

The superimposed features of (1). .

If x is 2 and F is 32, the to-be-processed images acquired in S101 are left and right images of a binocular camera, that is, N is 1, in this case, the feature tensor dimension of the features after superposition is

S104: and coding the superposed features by using the feature coding layer to obtain coded features.

Feature encoding is to fuse features. Continuing the above embodiment, the feature tensor dimension is obtained as

In this case, S104 may include: using the eigen coding layer to assign a dimension to the eigen tensor of

The overlapped features are down-sampled by the layer, so that the global features of the image can be obtained. Specifically, the second downsampling multiple y may be set in advance, for example, in fig. 3, the conv2-conv5 are downsampled once, and since the feature extraction layer is downsampled once, y may be 2⁶64. To distinguish from the sampling multiple in the above, the sampling multiple in the feature encoding layer is referred to herein as a second sampling multiple x.

The number of output channels per layer in the feature coding layer increases with the number of downsampling times, for example, conv2-conv5 in fig. 2 are downsampled once and the input of the feature coding layerNumber of channels 64, input feature tensor dimension of

S106: and performing deconvolution operation on the coded features by using a parallax recovery layer to obtain the parallaxes of the multiple images to be processed.

As an embodiment, the parallax recovery layer comprises a plurality of active two-dimensional deconvolution layers; s106 may include: and in each activated two-dimensional deconvolution layer, obtaining the parallax under one scale by using a preset activation function.

For example, the parallax recovery layer may be the lower half of fig. 3, including upconv, upconv5-upconv1, conv. Wherein, upconv is a two-dimensional deconvolution layer with convolution kernel of 3 x 3 and step length of 2; each of the upconv4 and upconv5 frames contains two-dimensional deconvolution layers with convolution kernels of 3 x 3, and the step size of the first two-dimensional deconvolution layer is 1, and the step size of the second two-dimensional deconvolution layer is 2; the structure of each frame of upcon v3, upcon v2 and upcon v1 can be as shown in fig. 4; conv is a two-dimensional deconvolution layer with a convolution kernel of 3 × 3 and a step size of 1, conv is an activated two-dimensional deconvolution layer, and in conv, a disparity in one scale can be obtained by using a sigmoid activation function, and the disparity can be a disparity map.

Conv in FIG. 4 is the same as conv in FIG. 3, and is an activated two-dimensional deconvolution layer; furthermore, iconv1 is a two-dimensional deconvolution layer with a convolution kernel of 3 × 3 and a step size of 1, and iconv2 is a two-dimensional deconvolution layer with a convolution kernel of 3 × 3 and a step size of 2. In fig. 3, except for the last convolutional layer, the other convolutional layers may be followed by a BN layer and an ELU layer.

In fig. 3, each frame in the upconv3-upconv1 contains one conv, and after the upconv1, there is one conv, that is, the parallax recovery layer contains 4 convs, so that 4 parallax maps can be obtained: disp1-disp 4. The resolutions of the 4 disparity maps are different, wherein the high-resolution disparity map can better retain the detail information in the image, and the low-resolution disparity map can better recover the global structure information of the disparity map.

The three disparity maps disp2-disp4 all belong to process quantities, and may not be output as an unsupervised neural network, the unsupervised neural network may only output disp1, and disp1 is the disparity of the multiple images to be processed, where each image corresponds to the disparity at one scale, or each image to be processed corresponds to one disparity map.

As described above, the multiple images to be processed acquired in S101 may be multiple images acquired by a multi-view camera, and as an embodiment, after obtaining the parallaxes of the multiple images to be processed in S106, the depth information of the multi-view camera may be calculated according to the obtained parallaxes.

Specifically, the depth information of the binocular camera may be calculated using the following equation:

where Z represents depth information, B is generally referred to as a base line (baseline) of the binocular camera, f represents a focal length of the binocular camera, and d represents parallax. It will be appreciated that for a multi-view camera, it may be considered as a plurality of binocular cameras, and the depth information for the multi-view camera may likewise be determined.

With the embodiment of the invention shown in fig. 1, the disparity between a plurality of images is determined by using an unsupervised neural network; on the first hand, the unsupervised neural network utilizes the loss function to train, does not need the real parallax as the supervision, the loss function contains one or more error parameters, in the training process, the error parameters become smaller gradually, namely the accuracy of determining the parallax becomes higher, therefore, the accuracy of determining the parallax by applying the scheme is higher; in a second aspect, the unsupervised neural network in this embodiment is an end-to-end neural network, the input of the neural network is an image to be processed, the output of the neural network is a parallax of the image, and the processing efficiency is high.

In some scenarios, parallax errors may result due to the effects of environmental factors. For example, as shown in fig. 5, the area a and the area B both belong to the ground area, but due to light reflection, the parallax difference between the area a and the area B is large, and the parallax of the area a is normal, and the parallax of the area B is abnormal.

For this situation, the network parameters of the trained unsupervised neural network may be adjusted to further improve the accuracy of determining the parallax. As an embodiment, after an unsupervised neural network is obtained by training a plurality of groups of sample images by using a preset loss function, determining an abnormal region in parallax output by the obtained unsupervised neural network; setting a new error parameter aiming at the abnormal area; adding the new error parameter to the preset loss function to obtain a new loss function; and adjusting the obtained unsupervised neural network by using the new loss function and the determined abnormal area to obtain the adjusted unsupervised neural network.

Specifically, an abnormal region in the parallax of the obtained unsupervised neural network output and a control region located on the same plane as the abnormal region may be determined; and calculating the plane distance between the comparison area and the abnormal area as a new error parameter.

The abnormal region may be the B region in fig. 5, the control region of the abnormal region may be the a region in fig. 5, the a region and the B region are located on the same plane, and the a region is not the abnormal region. Since the a region and the B region are located on the same plane, theoretically, the planar distance between the a region and the B region should be 0. For example, the three-dimensional coordinate values of the pixel points in the area a and the pixel points in the area B can be obtained by using the depth information obtained by the above calculation. And determining the plane of the area A according to the three-dimensional coordinate value of each pixel point in the area A. And calculating the average value of the distance from each pixel point in the B area to the plane of the A area according to the three-dimensional coordinate value of each pixel point in the B area, and taking the average value of the distance as the plane distance between the A area and the B area. The distance mean is not 0 due to errors.

The distance average value is added to the original loss function to obtain a new loss function, and the distance average value is set in the new loss function so as to minimize the plane distance between the B area and the A area. The new loss function is still the sum of the loss values of the multiple parallaxes obtained in the parallax recovery layer, wherein the loss value of the parallax at one scale may be the sum of the loss values of the four error parameters, that is, the image matching error parameter, the first weight, the parallax map smoothness error, the second weight, the parallax map left-right consistency error parameter, the third weight, and the difference between the plane parameters. For example, the fourth weight may be 0.1.

In this case, S102 is: and inputting the plurality of images to be processed into the adjusted unsupervised neural network.

In this embodiment, after the abnormal region is determined, part of the pixels in the abnormal region may be selected for processing. For example, 1000 groups of pixels can be selected as calibration data, where a group of pixels includes one pixel of each image in a group of sample images, and a group of pixels corresponds to the same point in the real space.

In the process of adjusting the unsupervised neural network by using the new loss function, the learning rate may not be greater than the initial learning rate if the initial learning rate is 10^-4Then the learning rate in the tuning process may be 5 x 10^-5Alternatively, the number of the terminal may be other, and is not particularly limited. In addition, the batch amount batch in the adjustment process can be set to be smaller, such as 4, and the training times can be set to be smaller, such as 10-15. The adjustment process in this embodiment may be understood as a weakly supervised training process.

For some scenes with abnormal parallax, the accuracy of determining the parallax can be further improved by applying the embodiment.

Corresponding to the above method embodiment, an embodiment of the present invention further provides an apparatus for determining image parallax, as shown in fig. 6, including:

an obtaining module 601, configured to obtain multiple images to be processed;

an input module 602, configured to input the multiple images to be processed into an unsupervised neural network obtained through pre-training; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters;

an extracting module 603, configured to extract features of the multiple images to be processed by using the feature extraction layer;

a superposition module 604, configured to utilize the feature superposition layer to superpose the features extracted by the feature extraction layer, so as to obtain superposed features;

an encoding module 605, configured to encode the superimposed features by using the feature encoding layer to obtain encoded features;

a disparity recovering module 606, configured to perform deconvolution operation on the encoded features by using the disparity recovering layer, so as to obtain the disparities of the multiple images to be processed.

As an embodiment, the feature tensor dimension of each image to be processed is W × H × 3, where W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of color channels of the image to be processed;

the extracting module 603 may be specifically configured to: for each image to be processed, the feature extraction layer is utilized to carry out convolution on the image to be processed to obtain the feature tensor dimensionality of

In one embodiment, the plurality of images to be processed includes N images to be processed, where N is a positive integer; the superposition module 604 may be specifically configured to:

Is characterized in thatThe features are superposed to obtain the feature tensor dimension of

The superimposed features of (1).

As an embodiment, the encoding module 605 may be specifically configured to:

using the eigen coding layer to assign a dimension to the eigen tensor of

As an embodiment, the parallax recovery layer comprises a plurality of active two-dimensional deconvolution layers; the disparity recovering module 606 may be specifically configured to:

As an embodiment, the preset loss function is a sum of loss values of a plurality of parallaxes obtained in the parallax recovery layer; the loss value comprises one or more of the following error parameters: image matching error parameters, parallax image smoothness error parameters and parallax image left-right consistency error parameters.

In one embodiment, the loss value of parallax at one scale is an image matching error parameter, a first weight, a parallax map smoothness error, a second weight, a parallax map left-right consistency error parameter, and a third weight, and the first weight, and the third weight are preset.

As an embodiment, the apparatus may further include: a determination module, a setting module, an adding module and an adjusting module (not shown in the figure), wherein,

the input module 602 may be specifically configured to: and inputting the plurality of images to be processed into the adjusted unsupervised neural network.

As an embodiment, the determining module may be specifically configured to: determining an abnormal region in the obtained disparity output by the unsupervised neural network and a comparison region which is positioned on the same plane with the abnormal region;

As an embodiment, the obtaining module 601 may be specifically configured to:

the apparatus may further include:

and a calculating module (not shown in the figure) for calculating the depth information of the multi-view camera according to the obtained parallax.

With the embodiment of the invention shown in fig. 6, the disparity between a plurality of images is determined by using an unsupervised neural network; on the first hand, the unsupervised neural network utilizes the loss function to train, does not need the real parallax as the supervision, the loss function contains one or more error parameters, in the training process, the error parameters become smaller gradually, namely the accuracy of determining the parallax becomes higher, therefore, the accuracy of determining the parallax by applying the scheme is higher; in a second aspect, the unsupervised neural network in this embodiment is an end-to-end neural network, the input of the neural network is an image to be processed, the output of the neural network is a parallax of the image, and the processing efficiency is high.

An embodiment of the present invention further provides an electronic device, as shown in fig. 7, including a processor 701 and a memory 702,

a memory 702 for storing a computer program;

the processor 701 is configured to implement any one of the above-described methods for determining image parallax when executing the program stored in the memory 702.

The electronic device may be a mobile phone, a computer, or other devices, or may also be a multi-view camera, and is not limited specifically.

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program is executed by a processor, the method for determining image parallax is implemented.

An embodiment of the present invention further provides a system for determining image parallax, as shown in fig. 8, including: a multi-view camera and a processing device, wherein,

This multi-view camera can be for two mesh cameras, and this two mesh cameras can be for the horizontal two mesh cameras, also can be for perpendicular two mesh cameras, also can be for pinhole two mesh cameras, fish eye two mesh cameras and so on, do not specifically do the restriction. Alternatively, the multi-view camera may be a three-view camera or a camera with three or more views.

The processing device may perform any of the image disparity determination methods described above.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the embodiment of the apparatus for determining image parallax shown in fig. 6, the embodiment of the electronic device shown in fig. 7, the embodiment of the system for determining image parallax shown in fig. 8, and the above-mentioned computer-readable storage medium, since they are substantially similar to the embodiment of the method for determining image parallax shown in fig. 1 to 5, the description is relatively simple, and relevant points can be found by referring to the partial description of the embodiment of the method for determining image parallax shown in fig. 1 to 5.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for determining image parallax, comprising:

acquiring a plurality of images to be processed;

inputting the plurality of images to be processed into the adjusted unsupervised neural network; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters; the determination mode of the unsupervised neural network after adjustment is as follows: determining an abnormal region in the parallax output by the unsupervised neural network obtained by training in advance; setting a new error parameter aiming at the abnormal area; adding the new error parameter to the preset loss function to obtain a new loss function; adjusting the unsupervised neural network obtained by pre-training by using the new loss function and the determined abnormal area to obtain an adjusted unsupervised neural network;

2. The method according to claim 1, wherein the feature tensor dimension of each image to be processed is W x H x 3, wherein W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of color channels of the image to be processed;

the extracting the features of the image to be processed by using the feature extraction layer comprises the following steps:

3. The method according to claim 2, wherein the plurality of images to be processed includes N images to be processed, where N is a positive integer; the step of superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features comprises the following steps:

Are superimposed on each otherObtaining a feature tensor dimension of

The superimposed features of (1).

4. The method according to claim 3, wherein said encoding the superimposed features by using the feature encoding layer to obtain encoded features comprises:

using the eigen coding layer to assign a dimension to the eigen tensor of

5. The method of claim 1, wherein the disparity recovery layer comprises a plurality of active two-dimensional deconvolution layers; the performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed includes:

6. The method according to claim 5, wherein the preset loss function is a sum of loss values of a plurality of parallaxes obtained in the parallax recovery layer; the loss value comprises one or more of the following error parameters: image matching error parameters, parallax image smoothness error parameters and parallax image left-right consistency error parameters.

7. The method according to claim 6, wherein a loss value of parallax at one scale is image matching error parameter + first weight + parallax map smoothness error + second weight + parallax map left and right consistency error parameter + third weight, and the first weight, and the third weight are preset.

8. The method of claim 1, wherein determining an abnormal region in disparity of the resulting unsupervised neural network output comprises:

setting new error parameters aiming at the abnormal area, wherein the setting comprises the following steps:

9. The method of claim 1, wherein said acquiring a plurality of images to be processed comprises: acquiring a plurality of images acquired by a multi-view camera as images to be processed;

after obtaining the parallaxes of the multiple images to be processed, the method further comprises the following steps:

10. An apparatus for determining image parallax, comprising:

the input module is used for inputting the plurality of images to be processed into the adjusted unsupervised neural network; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters;

the parallax recovery module is used for performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed;

the determining module is used for determining an abnormal region in the parallax output by the unsupervised neural network obtained through pre-training;

and the adjusting module is used for adjusting the unsupervised neural network obtained by pre-training by utilizing the new loss function and the determined abnormal area to obtain the adjusted unsupervised neural network.

11. The apparatus according to claim 10, wherein the feature tensor dimension of each image to be processed is W × H × 3, where W is the width of the image to be processed, H is the height of the image to be processed, and 3 represents the number of color channels of the image to be processed;

the extraction module is specifically configured to: for each image to be processed, the feature extraction layer is utilized to carry out convolution on the image to be processed to obtain the feature tensor dimensionality of

12. The apparatus according to claim 11, wherein the plurality of images to be processed includes N images to be processed, where N is a positive integer; the superposition module is specifically configured to:

The superimposed features of (1).

13. The apparatus according to claim 12, wherein the encoding module is specifically configured to:

using the eigen coding layer to assign a dimension to the eigen tensor of

14. The apparatus of claim 10, wherein the disparity recovery layer comprises a plurality of active two-dimensional deconvolution layers; the parallax recovery module is specifically configured to:

15. The apparatus of claim 14, wherein the preset loss function is a sum of loss values of a plurality of parallaxes obtained in the parallax recovery layer; the loss value comprises one or more of the following error parameters: image matching error parameters, parallax image smoothness error parameters and parallax image left-right consistency error parameters.

16. The apparatus according to claim 15, wherein a loss value of parallax at one scale is an image matching error parameter + a first weight + a parallax map smoothness error + a second weight + a parallax map left and right consistency error parameter + a third weight, and the first weight, and the third weight are preset.

17. The apparatus of claim 10, wherein the determining module is specifically configured to: determining an abnormal region in the obtained disparity output by the unsupervised neural network and a comparison region which is positioned on the same plane with the abnormal region;

the setting module is specifically configured to: and calculating the plane distance between the comparison area and the abnormal area as a new error parameter.

18. The apparatus of claim 10, wherein the obtaining module is specifically configured to:

the device further comprises:

19. An image parallax determination system, comprising: a multi-view camera and a processing device, wherein,

the processing device is used for receiving the plurality of images as a plurality of images to be processed; inputting the plurality of images to be processed into the adjusted unsupervised neural network; the unsupervised neural network comprises a feature extraction layer, a feature superposition layer, a feature coding layer and a parallax recovery layer; the unsupervised neural network is as follows: training a plurality of groups of sample images by using a preset loss function, wherein each group of sample images comprises a plurality of images with parallax, and the preset loss function comprises one or more error parameters; the determination mode of the unsupervised neural network after adjustment is as follows: determining an abnormal region in the parallax output by the unsupervised neural network obtained by training in advance; setting a new error parameter aiming at the abnormal area; adding the new error parameter to the preset loss function to obtain a new loss function; adjusting the unsupervised neural network obtained by pre-training by using the new loss function and the determined abnormal area to obtain an adjusted unsupervised neural network; extracting the features of the multiple images to be processed by using the feature extraction layer; superposing the features extracted by the feature extraction layer by using the feature superposition layer to obtain superposed features; coding the superposed features by using the feature coding layer to obtain coded features; and performing deconvolution operation on the coded features by using the parallax recovery layer to obtain the parallaxes of the multiple images to be processed.